Deep Learning Deep Learning Ian Go Goo odfello dfellow w Yosh oshua ua Bengio Ian GoCourville odfellow Aaron Yoshua Bengio Aaron Courville
Con Conten ten tents ts Contents Website
vii
Wcebsite A kno knowledgmen wledgmen wledgments ts
vii viii
Acknowledgments Notation
viii xi
Notation 1 In Intro tro troduction duction 1.1 Who Should Read This Bo Book? ok? . . . . . . . . 1 1.2 Introduction Historical Trends in Deep Learning . . . . . 1.1 Who Should Read This Book? . . . . . . . . 1.2 Historical Trends in Deep Learning . . . . . I Applied Math and Mac Machine hine Learning Basics I Applied Math and Machine Learning Basics 2 Linear Algebra 2.1 Scalars, Vectors, Matrices and Tensors . . . 2 2.2 LinearMultiplying Algebra Matrices and Vectors . . . . . . 2.1 Scalars, ectors, Matrices and T 2.3 Iden Identit tit tity yV and In Inverse verse Matrices . ensors . . . . .. .. .. 2.2 Linear Multiplying Matrices and Vectors 2.4 Dep Dependence endence and Span . . .. .. .. .. .. .. 2.3 Norms Identity .and 2.5 . . In . verse . . . .Matrices . . . . . .. .. .. .. .. .. .. .. 2.4 Sp Linear endence and Span . . . . .. .. .. 2.6 Special ecial Dep Kinds of Matrices and V. ectors 2.5 Eigendecomp Norms . . . osition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 2.7 Eigendecomposition 2.6 Sp ecial Kinds Matrices and V 2.8 Singular ValueofDecomp Decomposition osition . ectors . . . . .. .. .. 2.7 The Eigendecomp osition Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.9 Mo Moore-P ore-P ore-Penrose enrose Pseudoinverse 2.8 The Singular Value Decomp. osition 2.10 Trace Op Operator erator . . . . .. .. .. .. .. .. .. .. 2.9 The The Determinan Mo ore-Penrose 2.11 Determinant t . Pseudoinv . . . . . . erse . . .. .. .. .. .. .. 2.10 The T race Op erator . . . . . . .Analysis . . . . . .. 2.12 Example: Principal Comp Components onents 2.11 The Determinant . . . . . . . . . . . . . . . 2.12 Example: Principal Components Analysis . 3 Probabilit Probability y and Information Theory 3.1 Wh Why y Probabilit Probability? y? . . . . . . . . . . . . . . . 3 Probability and Information Theory 3.1 Why Probability? . . . . . . . . . . . . . . . i i
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. . . .
. . . .
xi1 8 1 11 8 11 29
. . .. .. .. .. .. .. .. .. .. .. . . .
29 31 31 31 34 31 36 34 37 36 39 37 40 39 42 40 44 42 45 44 46 45 47 46 48 47 48 53 54 53 54
. . . . . . . . . . . .
CONTENTS
3.2 Random Variables . . . . . . . . . . . . . . 3.3 Probabilit Probability y Distributions . . . . . . . . . . . 3.2 Random V ariables y. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit Probability 3.3 Conditional Probability Distributions 3.5 Probabilit Probability y .. .. .. .. .. .. .. .. .. .. .. 3.4 Marginal Probabilit y . . . . . .Probabilities . . . . . . . 3.6 The Chain Rule of Conditional 3.5 Indep Conditional y . . . Indep . . . .endence . . . . 3.7 Independence endenceProbabilit and Conditional Independence 3.6 Exp The Chain Rule of Conditional Probabilities 3.8 Expectation, ectation, Variance and Co Cov variance . . . 3.7 Indep endence and Conditional Indep endence 3.9 Common Probabilit Probability y Distributions . . . . . 3.8 Exp ectation, V ariance and CovFariance 3.10 Useful Prop Properties erties of Common unctions . .. .. 3.9 Ba Common Probabilit 3.11 Bay yes’ Rule . . . . y. Distributions . . . . . . . . .. .. .. .. .. 3.10 T Useful Prop erties of of Con Common Functions 3.12 echnical Details Contin tin tinuous uous Variables . .. 3.11 Information Bayes’ Rule Theory . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 3.13 3.12 T echnical Details of Contin uous 3.14 Structured Probabilistic Mo Models dels V . ariables . . . . . .. 3.13 Information Theory . . . . . . . . . . . . . . Structured Probabilistic Models . . . . . . . 4 3.14 Numerical Computation 4.1 Ov Overflo erflo erflow w and Underflo Underflow w . . . . . . . . . . . 4 4.2 Numerical Computation Poor Conditioning . . . . . . . . . . . . . . 4.1 Ov erflowt-Based and Underflo w . . . .. .. .. .. .. .. .. .. 4.3 Gradien Gradient-Based Optimization 4.2 Constrained Poor Conditioning . . . . .. .. .. .. .. .. .. .. .. .. 4.4 Optimization 4.3 Example: Gradient-Based 4.5 LinearOptimization Least Squares. .. .. .. .. .. .. .. 4.4 Constrained Optimization . . . . . . . . . . 4.5 Example: Linear Least Squares . . . . . . . 5 Mac Machine hine Learning Basics 5.1 Learning Algorithms . . . . . . . . . . . . . 5 5.2 Machine Learning Basicsand Underfitting . . . Capacit Capacity y, Overfitting 5.1 Hyp Learning Algorithms . alidation . . . . . .Sets . . .. .. .. .. 5.3 Hyperparameters erparameters and V 5.2 Estimators, Capacity, Overfitting and Underfitting 5.4 Bias and V ariance . . . . . .. .. .. 5.3 Hyp erparameters and V alidation 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation Sets . . .. .. .. .. 5.4 Ba Estimators, Bias and. V. ariance 5.6 Bay yesian Statistics . . . . .. .. .. .. .. .. .. .. 5.5 Sup Maxim um Lik elihoodAlgorithms Estimation. .. .. .. .. .. .. 5.7 Supervised ervised Learning 5.6 Unsup Ba yesian Statistics . . Algorithms . . . . . . . .. .. .. .. .. 5.8 Unsupervised ervised Learning 5.7 Sto Sup Learning Algorithms 5.9 Stoccervised hastic Gradien Gradient t Descen Descent t . . .. .. .. .. .. .. .. 5.8 Unsup ervised Learning Algorithms . . . .. .. 5.10 Building a Machine Learning Algorithm 5.9 Challenges Stochastic Gradien t Descen . . . . . .. .. .. .. 5.11 Motiv Motivating ating Deept Learning 5.10 Building a Machine Learning Algorithm . . 5.11 Challenges Motivating Deep Learning . . . . II Deep Net Netw works: Mo Modern dern Practices II Deep Deep FNet works: dern Practices 6 eedforw eedforward ardMo Netw Networks orks 6.1 Example: Learning XOR . . . . . . 6 6.2 Deep Gradien Feedforw ard Netw orks. . . . . . Gradient-Based t-Based Learning 6.1 Example: Learning XOR . . . . . . 6.2 Gradient-Based Learning . . ii. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
. . .. .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. . .
. . . .
56 56 56 58 56 59 58 59 59 60 59 60 60 62 60 67 62 70 67 71 70 72 71 75 72 75 80 80 80 82 80 82 82 93 82 96 93 96 98 99 98 110 99 120 110 122 120 131 122 135 131 139 135 145 139 150 145 152 150 154 152 154 165 165 167 170 167 176 170 176
CONTENTS
7 7
8 8
9 9
6.3 Hidden Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.4 Arc Architecture hitecture Design . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.3 Hidden Units . . and . . . Other . . . .Differen . . . . tiation . . . . Algorithms . . . . . . . .. .. .. .. .. 203 190 6.5 Bac Back-Propagation k-Propagation Differentiation 6.4 Historical Architecture Design 196 6.6 Notes . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 224 6.5 Back-Propagation and Other Differentiation Algorithms . . . . . 203 6.6 Historical Notes . . . Learning . . . . . . . . . . . . . . . . . . . . . . . . . 228 224 Regularization for Deep 7.1 Parameter Norm Penalties . . . . . . . . . . . . . . . . . . . . . . 230 Regularization for Deep LearningOptimization . . . . . . . . . . . . 228 7.2 Norm Penalties as Constrained 237 7.1 P arameter Norm P enalties . . . . . . . . . . . . . . . . . . . . . . 230 7.3 Regularization and Under-Constrained Problems . . . . . . . . . 239 7.2 Dataset Norm Penalties as Constrained 237 7.4 Augmen Augmentation tation . . . . .Optimization . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 240 7.3 Noise Regularization and. Under-Constrained 239 7.5 Robustness . . . . . . . . . . . Problems . . . . . . .. .. .. .. .. .. .. .. .. 242 7.4 Semi-Sup Dataset Augmen . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 244 240 7.6 Semi-Supervised ervised tation Learning 7.5 Multi-T Noise Robustness . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 242 7.7 Multi-Task ask Learning 7.6 Semi-Sup ervised Learning . . . . . . . . . . . . . . . . . . . . . . 244 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.7 P Multi-T ask TLearning . arameter . . . . . .Sharing . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 245 7.9 arameter ying and P 251 7.8 Early Stopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 7.10 Sparse Represen Representations tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.9 P arameter T ying and Parameter Sharing 251 7.11 Bagging and Other Ensemble Metho Methods ds . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 7.10 Sparse Represen tations . . . . . . . . . . . . . . . . . . . . . . . . 253 7.12 Drop Dropout out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 7.11 Bagging andTOther 7.13 Adv dversarial ersarial rainingEnsemble . . . . .Metho . . . ds . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 255 267 7.12 T Drop out Distance, . . . . . .Tangent . . . . .Prop, . . . and . . .Manifold . . . . . T.angent . . . . Classifier . . . . . 268 257 7.14 angent 7.13 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 267 7.14 Tangent Distance, Tangent Prop, and 268 Optimization for Training Deep Mo Models delsManifold Tangent Classifier 274 8.1 Ho How w Learning Differs from Pure Optimization . . . . . . . . . . . 275 Optimization raining Deep Models 8.2 Challengesfor in T Neural Netw Network ork Optimization . . . . . . . . . . . . 274 282 8.1 Basic How Learning Differs 275 8.3 Algorithms . . from . . .P . ure . . Optimization . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. 294 8.2 P Challenges in Neural Netw ork Optimization 8.4 arameter Initialization Strategies . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 282 301 8.3 Basic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 8.5 Algorithms with Adaptiv Adaptivee Learning Rates . . . . . . . . . . . . . 306 8.4 P arameter Initialization Strategies 301 8.6 Appro Approximate ximate Second-Order Metho Methods ds. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 310 8.5 Optimization Algorithms with Adaptivand e Learning Rates . . .. .. .. .. .. .. .. .. .. .. .. 318 306 8.7 Strategies Meta-Algorithms 8.6 Approximate Second-Order Methods . . . . . . . . . . . . . . . . 310 8.7 Optimization Strategies 318 Con Conv volutional Netw Networks orks and Meta-Algorithms . . . . . . . . . . . 331 9.1 The Con Conv volution Op Operation eration . . . . . . . . . . . . . . . . . . . . . 332 Con v olutional Netw orks 9.2 Motiv Motivation ation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 336 9.1 P The Con.volution 332 9.3 ooling . . . . .Op . .eration . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 340 9.2 Con Motiv ation . and . . .Po. oling . . . as . .an . .Infinitely . . . . . Strong . . . . Prior . . . .. .. .. .. .. .. .. 346 336 9.4 Conv volution 9.3 V Pariants ooling . of. the . . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 340 9.5 Conv 9.4 Con v olution and P o oling as an Infinitely Strong Prior . . . . . . . 346 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.5 Data Variants ofesthe 9.7 Typ ypes . .Basic . . . Con . . v. olution . . . .F . unction . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 348 361 9.6 Structured Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . 359 9.8 Efficien Efficientt Con Conv volution Algorithms . . . . . . . . . . . . . . . . . . 363 9.7 Data T yp es . . . ervised . . . . .Features . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 364 361 9.9 Random or Unsup Unsupervised 9.8 Efficient Convolution Algorithms . . . . . . . . . . . . . . . . . . 363 iii 9.9 Random or Unsupervised Features . . . . . . . . . . . . . . . . . 364
CONTENTS
10 10
11 11
12 12
III
9.10 The Neuroscien Neuroscientific tific Basis for Conv Convolutional olutional Netw Networks orks . . . . 9.11 Con Conv volutional Net Networks works and the History of Deep Learning . 9.10 The Neuroscientific Basis for Convolutional Networks . . . . 9.11 ConvMo olutional NetRecurrent works and the History of Deep Learning . Sequence Modeling: deling: and Recursiv Recursive e Nets 10.1 Unfolding Computational Graphs . . . . . . . . . . . . . . . Sequence Motdeling: 10.2 Recurren Recurrent Neural Recurrent Net Netw works .and . . .Recursiv . . . . . e. Nets . . . . . . . . 10.1 Bidirectional Unfolding Computational 10.3 RNNs . . . .Graphs . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.2 Enco Recurren t Neural Networks . . . . . . .Architectures . . . . . . . .. .. .. .. 10.4 Encoder-Deco der-Deco der-Decoder der Sequence-to-Sequence 10.3 Deep Bidirectional RNNs . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.5 Recurren Recurrent t Net Netw w. orks 10.4 Enco der-Deco der Sequence-to-Sequence 10.6 Recursiv Recursivee Neural Net Netw works . . . . . . . .Architectures . . . . . . . .. .. .. .. 10.5 The DeepChallenge RecurrentofNet workserm. .Dep . .endencies . . . . . .. .. .. .. .. .. .. .. .. .. 10.7 Long-T Long-Term Dependencies 10.6 Ec Recursiv e Neural Netw.orks 10.8 Echo ho State Net Netw works . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 10.7 Leaky The Challenge of Other Long-T erm Dependencies . . Time . . . .Scales . . . .. 10.9 Units and Strategies for Multiple 10.8 Ec ho State Net w orks . . . . . . . . . . . . . . . . . . . 10.10 The Long Short-T Short-Term erm Memory and Other Gated RNNs .. .. .. 10.9 Optimization Leaky Units and Strategies for Multiple 10.11 for Other Long-T Long-Term erm Dep Dependencies endencies . . Time . . . .Scales . . . .. 10.10 Explicit The LongMemory Short-Term 10.12 . . .Memory . . . . .and . . Other . . . . Gated . . . .RNNs . . . .. .. .. 10.11 Optimization for Long-Term Dependencies . . . . . . . . . . 10.12 Explicit Memory Practical metho methodology dology. . . . . . . . . . . . . . . . . . . . . . . . 11.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . Practical metho dology 11.2 Default Baseline Mo Models dels . . . . . . . . . . . . . . . . . . . . 11.1 P erformance Metrics . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.3 Determining Whether .to. Gather 11.2 Selecting Default Baseline Models . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 11.4 Hyp Hyperparameters erparameters 11.3 Determining Whether 11.5 Debugging Strategies .to. Gather . . . . .More . . .Data . . . .. .. .. .. .. .. .. .. .. 11.4 Example: Selecting Hyp erparameters . . Recognition . . . . . . . .. .. .. .. .. .. .. .. .. .. 11.6 Multi-Digit Number 11.5 Debugging Strategies . . . . . . . . . . . . . . . . . . . . . . 11.6 Example: Multi-Digit Number Recognition . . . . . . . . . . Applications 12.1 Large Scale Deep Learning . . . . . . . . . . . . . . . . . . . Applications 12.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Large Deep Learning 12.3 Sp Speec eec eech hScale Recognition . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.2 Natural Computer Vision Pro . . cessing . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.4 Language Processing 12.3 Other SpeechApplications Recognition .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 12.5 12.4 Natural Language Processing . . . . . . . . . . . . . . . . . 12.5 Other Applications . . . . . . . . . . . . . . . . . . . . . . . Deep Learning Researc Research h
III Linear Deep F Learning Researc h 13 actor Mo Models dels 13.1 Probabilistic PCA and Factor Analysis . 13 13.2 LinearIndep Factor Mo dels onent Analysis (ICA) Independen enden endent t Comp Component 13.1 Probabilistic PCA and F.actor 13.3 Slo Slow w Feature Analysis . . . Analysis . . . . . .. 13.2 Sparse Independen t Comp 13.4 Co Coding ding . . .onent . . . Analysis . . . . . (ICA) . . . . 13.3 Slow Feature Analysis . . . . . . . . . . 13.4 Sparse Coding . . . . . . . . iv . . . . . . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . .. .. . .
. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .
. . .. .. . .
. . . . . . .. .. .. .. .. .. .. .. .. .. . . . . .. .. .. .. . . . . .. .. .. . .
. . .. .. . .
. 365 . 372 . 365 . 374 372 . 376 . 374 379 376 .. 396 .. 379 397 .. 399 396 .. 397 401 399 .. 403 401 .. 406 403 .. 409 .. 406 411 .. 409 415 411 .. 419 . 415 . 424 419 . 425 . 424 428 .. 429 425 428 .. 430 .. 439 429 430 .. 443 . 439 . 446 443 . 446 . 446 455 .. 446 461 455 .. 464 461 .. 480 . 464 . 480 489 . . .. .. . .
489 492 493 492 494 493 496 494 499 496 499
CONTENTS
13.5 Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 502 Manifold 14 13.5 Auto Autoenco enco encoders ders Interpretation of PCA . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . 14 14.2 Autoenco ders Regularized Auto Autoenco enco encoders ders . . . . . . . . . . . . . . . . . . . . . . 14.1 Undercomplete Auto enco ders . . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.3 Represen Representational tational Power, La Lay yer Size 14.2 Sto Regularized Auto enco dersDeco . .ders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.4 Stocchastic Enco Encoders ders and Decoders 14.3 Denoising Representational Power, 14.5 Auto Autoenco enco encoders ders La . y. er. .Size . . and . . .Depth . . . .. .. .. .. .. .. .. .. .. .. .. 14.4 Learning StochasticManifolds Encoderswith and Deco ders . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Auto Autoenco enco encoders ders 14.5 Denoising Auto enco ders . . . . . . 14.7 Con Contractiv tractiv tractivee Auto Autoenco enco encoders ders . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.6 Predictiv Learning with Auto encoders 14.8 PredictiveeManifolds Sparse Decomp Decomposition osition . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.7 Applications Contractive Auto encoenco dersders . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 14.9 of Auto Autoenco encoders 14.8 Predictive Sparse Decomposition . . . . . . . . . . . . . . . . . . Applications of Autoencoders . . . . . . . . . . . . . . . . . . . . 15 14.9 Represen Representation tation Learning 15.1 Greedy La Lay yer-Wise Unsup Unsupervised ervised Pretraining . . . . . . . . . . . 15 15.2 Represen tation Learning Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 15.1 Semi-Sup Greedy Laervised yer-Wise Unsupervised retraining . . .. .. .. .. .. .. .. .. .. 15.3 Semi-Supervised Disentangling of P Causal Factors 15.2 Distributed Transfer Learning and Domain 15.4 Representation . . .A.daptation . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 15.3 Semi-Sup ervised Disentangling of Causal 15.5 Exp Exponen onen onential tial Gains from Depth . . . . . F. actors . . . . .. .. .. .. .. .. .. .. .. 15.4 Representation . . . . . . .Causes . . . . .. .. .. .. .. .. .. .. .. .. 15.6 Distributed Pro to Disco Providing viding Clues Discov ver. Underlying 15.5 Exponential Gains from Depth . . . . . . . . . . . . . . . . . . . Providing Clues to Disco ver Underlying . . . . . . . . . . 16 15.6 Structured Probabilistic Mo Models dels for DeepCauses Learning 16.1 The Challenge of Unstructured Mo Modeling deling . . . . . . . . . . . . . . 16 16.2 Structured Probabilistic Mo dels for Deep Learning Using Graphs to Describ Describee Mo Model del Structure . . . . . . . . . . . . . 16.1 Challenge Unstructured Mo deling . 16.3 The Sampling from of Graphical Mo . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. Models dels 16.2 A Using Graphs Describe Mo Modeling del Structure 16.4 dv dvantages antages of to Structured Modeling . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 16.3 Learning Sampling ab from Models 16.5 about outGraphical Dep Dependencies endencies . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.4 Inference Advantages of Approximate Structured Mo deling . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 16.6 and Inference 16.5 Learning ab out Dep endencies . . . . . . Probabilistic . . . . . . . Mo . . dels . . . 16.7 The Deep Learning Approach to. .Structured Models 16.6 Inference and Approximate Inference . . . . . . . . . . . . . . . . DeepMetho Learning 17 16.7 Mon Monte teThe Carlo Methods ds Approach to Structured Probabilistic Models 17.1 Sampling and Monte Carlo Metho Methods ds . . . . . . . . . . . . . . . . 17 17.2 Monte Carlo Metho ds Imp Importance ortance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Sampling and Monte ds 17.3 Mark Marko ov Chain Mon Monte te Carlo Carlo Metho Metho Methods ds .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.2 Gibbs Importance Sampling 17.4 Sampling . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 17.3 The MarkChallenge ov Chain Mon te Carlo Metho ds . . . . Mo . . des . . .. .. .. .. .. .. .. .. 17.5 of Mixing betw etween een Separated Modes 17.4 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theting Challenge of MixingFunction between Separated Modes . . . . . . . . 18 17.5 Confron Confronting the Partition 18.1 The Log-Lik Log-Likeliho eliho elihoo od Gradient . . . . . . . . . . . . . . . . . . . . 18 18.2 Confron ting theMaximum Partition Function Sto Stoc chastic Likelihoo Likelihood d and Contrastiv Contrastivee Divergence . . . 18.1 The Log-Likelihood Gradient . . . . . . . . . . . . . . . . . . . . 18.2 Stochastic Maximum Likelihoo v d and Contrastive Divergence . . .
502 505 506 505 507 506 511 507 512 511 513 512 518 513 524 518 526 524 527 526 527 529 531 529 539 531 544 539 549 544 556 549 557 556 557 561 562 561 566 562 583 566 584 583 585 584 586 585 587 586 587 593 593 593 595 593 598 595 602 598 602 602 602 608 609 608 610 609 610
CONTENTS
18.3 Pseudolik Pseudolikeliho eliho elihoo od . . . . . . . . . . . . . . . . . . . . . . . 18.4 Score Matc Matching hing and Ratio Matching . . . . . . . . . . . . 18.3 Denoising Pseudolikeliho odMatching . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Score 18.4 Noise-Con Score Matctrastiv hing and Ratio Matching 18.6 Noise-Contrastiv trastive e Estimation . . . . .. .. .. .. .. .. .. .. .. .. .. .. 18.5 Denoising Score Matching . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. 18.7 Estimating the Partition Function 18.6 Noise-Contrastive Estimation . . . . . . . . . . . . . . . . Estimating the Partition Function . . . . . . . . . . . . . . 19 18.7 Appro Approximate ximate inference 19.1 Inference as Optimization . . . . . . . . . . . . . . . . . . 19 19.2 ApproExp ximate inference Expectation ectation Maximization . . . . . . . . . . . . . . . . . . 19.1 Inference as Optimization . . ding . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.3 MAP Inference and Sparse Co Coding 19.2 V Exp ectationInference Maximization . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 ariational and Learning 19.3 Learned MAP Inference and Sparse Coding 19.5 Appro Approximate ximate Inference . . .. .. .. .. .. .. .. .. .. .. .. .. .. 19.4 Variational Inference and Learning . . . . . . . . . . . . . Appro ximate 20 19.5 Deep Learned Generativ Generative e Mo Models dels Inference . . . . . . . . . . . . . . . 20.1 Boltzmann Mac Machines hines . . . . . . . . . . . . . . . . . . . . . 20 20.2 Deep Restricted GenerativBoltzmann e Mo dels Machines . . . . . . . . . . . . . . . 20.1 Deep Boltzmann hines 20.3 Belief Mac Netw Networks orks .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.2 Deep Restricted Boltzmann Machines 20.4 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.3 Deep Belief Netw orks . . . . . .alued . . . Data . . . .. .. .. .. .. .. .. .. .. 20.5 Boltzmann Mac Machines hines for Real-V Real-Valued 20.4 Boltzmann MachinesMac . . hines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.6 Deep Con Boltzmann Conv volutional Machines 20.5 Boltzmann Boltzmann Mac Mac hines for for Structured Real-ValuedorData . . . . Outputs . . . . . 20.7 Machines hines Sequential 20.6 Other Convolutional Boltzmann Mac. hines 20.8 Boltzmann Machines . . . .. .. .. .. .. .. .. .. .. .. .. .. .. 20.7 Boltzmann Mac hines for Structured or Sequential 20.9 Bac Back-Propagation k-Propagation through Random Op Operations erations . Outputs . . . . . 20.8 Other Boltzmann Machines . . . . . . . . . . . . . 20.10 Directed Generative Nets . . . . . . . . . . . . . . . .. .. .. .. 20.9 Dra Bac k-Propagation through Random 20.11 Drawing wing Samples from Auto Autoenco enco encoders dersOp. erations . . . . . .. .. .. .. .. .. 20.10 Generativ Directed Generative .works . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Generative e Sto StocchasticNets Net Netw 20.11 Dra wing Samples from Auto 20.13 Other Generation Schemes . enco . . ders . . . .. .. .. .. .. .. .. .. .. .. .. .. 20.12 Ev Generativ Stochastic Net w orks. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating aluatinge Generative Mo Models dels 20.13 Conclusion Other Generation 20.15 . . . . Schemes . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliograph Bibliography y
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. . .. .. .. . . . . .. .. .. . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. . .
. 618 . 620 618 .. 622 620 .. 623 .. 626 622 . 623 . 634 626 . 636 . 634 637 .. 638 636 .. 637 641 .. 653 638 . 641 . 656 653 . 656 . 656 658 .. 662 656 658 .. 665 .. 662 678 .. 665 685 .. 678 687 .. 688 685 687 .. 689 .. 688 694 689 .. 712 694 .. 716 .. 712 717 .. 719 716 .. 717 721 . 719 . 723 721
Bibliography Index
723 780
Index
780
vi
Website Website
www.deeplearningb www.deeplearningbo ook.org www.deeplearningbook.org
This book is accompanied by the ab abov ov ovee website. The website provides a variety of supplemen supplementary tary material, including exercises, lecture slides, corrections of This b o ok is accompanied the ab e website. website provides a mistak mistakes, es, and other resources thatbyshould beov useful to both The readers and instructors. variety of supplementary material, including exercises, lecture slides, corrections of mistakes, and other resources that should be useful to both readers and instructors.
vii vii
Ackno knowledgmen wledgmen wledgments ts Acknowledgments
This book would not ha hav ve been possible without the con contributions tributions of man many y people. would lik likeenot to ha thank those who commen commented prop proposal osal for the book ThisWbeook would ve been possible without ted the on conour tributions of man y people. and help helped ed plan its con conten ten tents ts and organization: Guillaume Alain, Kyungh Kyunghyun yun Cho, W e w ould lik e to thank those who commen ted on our prop osal for book Çağlar Gülçehre, Da David vid Krueger, Hugo Larochelle Larochelle,, Razv Razvan an Pascan Pascanu u and the Thomas and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Rohée. Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas We would like to thank the people who offered feedback on the conten contentt of the Rohée. book itself. Some offered feedbac feedback k on many chapters: Martín Abadi, Guillaume W e w ould like to thank the people who offered feedback on Can the conten of the Alain, Ion Androutsopoulos, Fred Bertsc Bertsch, h, Olexa Bilaniuk, Ufuk Biçici,tMatk Matko o book itself. Some offered Greg feedbac k on manyPierre chapters: Bošnjak, John Boersma, Bro Broc ckman, Luc Martín Carrier,Abadi, SarathGuillaume Chandar, Alain, Ion Androutsopoulos, Fred Bertsc h, Olexa Bilaniuk, Ufuk Can Biçici, Matko P awel Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Bošnjak, John F Boersma, Brockman, Pierre Luc Carrier, Sarath Chandar, Jim Fan, Miao an, MeireGreg Fortunato, Frédéric Francis, Nando de Freitas, Çağlar P a w el Chilinski, Mark Daoust, Oleg Dashevskii, Laurent Dinh, Stephan Dreseitl, Gülçehre, Jurgen Van Gael, Javier Alonso García, Jonathan Hunt, Gopi Jeyaram, Jim Fan,Kab Miao an,Luk Meire Fortunato, Frédéric Francis, NandoJohn de FKing, reitas,Diederik Çağlar Chingiz Kabyta yta ytay yFev, Lukasz asz Kaiser, Varun Kanade, Akiel Khan, Gülçehre, Jurgen an Gael,Rudolf JavierMathey Alonso, García, Hunt, Gopi Jeyaram, P . Kingma, Yann VLeCun, Mathey, Matías Jonathan Mattamala, Abhinav Maurya, Chingiz Kab yta y ev, Luk asz Kaiser, V arun Kanade, Akiel Khan, John King, Diederik Kevin Murphy Murphy,, Oleg Mürk, Roman Nov Novak, ak, Augustus Q. Odena, Simon Pa Pavlik, vlik, P . Kingma, Y ann LeCun, Rudolf Mathey , Matías Mattamala, Abhinav Maurya, Karl Pichotta, Kari Pulli, Tapani Raiko, An Anurag urag Ranjan, Johannes Roith, Halis KevinCésar Murphy , OlegGrigory Mürk, Sapunov, Roman Nov ak, Augustus Q. Odena, Simon Pavlik, Sak, Salgado, Mik Mike e Sch Schuster, uster, Julian Serban, Nir Shabat, Karl Shirriff, Pichotta,Scott KariStanley Pulli, T Anurag Ranjan, Johannes Roith, Halis Ken Stanley, , apani DavidRaiko, Sussillo, Ilya Sutsk Sutskev ev ever, er, Carles Gelada Sáez, Sak, César Salgado, Grigory Sapunov, Mik e Sch uster, Julian Serban, Nir Shabat, Graham Taylor, Valen alentin tin Tolmer, An Tran, Shubhendu Trivedi, Alexey Umnov, Ken Shirriff, Scott Stanley Sussillo, Ilya Sutsk ever,WCarles Gelada Sáez, Vincen Vincentt Vanhouc anhouck ke, Marco, David Visen Visentini-Scarzanella, tini-Scarzanella, Da David vid arde-F arde-Farley arley arley, , Dustin Graham Taylor, Tolmer, Tran, tShubhendu rivedi, Alexey W ebb, Kelvin Xu,Valen Wei tin Xue, Li Yao,An Zygmun Zygmunt Za Zając jąc and T Ozan Çağlay Çağlayan. an. Umnov, Vincent Vanhoucke, Marco Visentini-Scarzanella, David Warde-Farley, Dustin We would also lik likee to thank those who provided us with useful feedbac feedback k on Webb, Kelvin Xu, Wei Xue, Li Yao, Zygmunt Za jąc and Ozan Çağlayan. individual chapters: We would also like to thank those who provided us with useful feedback on individual chapters: • Chapter 1, Introduction: Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu Chapter 1, Introduction : Yusuf Akgul, Sebastien Bratieres, Samira Ebrahimi, and Alfredo Solano. • Charlie Gorichanaz, Brendan Loudermilk, Eric Morris, Cosmin Pârvulescu 2, Linear Algebra: Amjad Almahairi, Nik Nikola ola Banić, Kevin Bennett, • Chapter and Alfredo Solano. viiiAlmahairi, Nikola Banić, Kevin Bennett, Chapter 2, Linear Algebra: Amjad
•
viii
CONTENTS
• • • • • • • • • • • • • • • • • • • • •
Philipp Philippee Castonguay Castonguay,, Oscar Chang, Eric Fosler-Lussier, Andrey Khaly Khalya avin, Sergey Oreshk Oreshko ov, Istv István án Petrás, Dennis Prangle, Thomas Rohée, Colb Colby y Philipp e Castonguay , Oscar Chang, Eric F osler-Lussier, Andrey Khaly a vin, Toland, Massimiliano Tomassoli, Alessandro Vitale and Bob Welland. Sergey Oreshkov, István Petrás, Dennis Prangle, Thomas Rohée, Colby Toland, Massimiliano omassoli, Alessandro Vitale and Bob Anderson, Welland. Kai Chapter 3, ProbabilityTand Information Theory : John Philip Arulkumaran, Vincen Vincentt Dumoulin, Rui Fa, Stephan Gouws, Artem Ob Oboturov, oturov, Chapter 3 , Probability and Information Theory : John Philip Anderson, Kai An Antti tti Rasmus, Andre Simp Simpelo, elo, Alexey Surk Surko ov and Volk olker er Tresp. Arulkumaran, Vincent Dumoulin, Rui Fa, Stephan Gouws, Artem Oboturov, Antti Rasmus, Andre Simp elo, Alexey Surk v andAn, Volk er T resp. and Hu Chapter 4, Numerical Computation : T ran oLam Ian Fischer, Yuhuang. Chapter 4, Numerical Computation: Tran Lam An, Ian Fischer, and Hu Yuhuang.5, Machine Learning Basics: Dzmitry Bahdanau, Nikhil Garg, Chapter Mak Makoto oto Otsuk Otsuka, a, Bob Pepin, Philip Popien, Emmanuel Ra Rayner, yner, Kee-Bong Chapter 5 , Machine Learning Basics : Dzmitry Bahdanau, Nikhil Garg, Song, Zheng Sun and Andy Wu. Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Kee-Bong Song, Zheng Sun F and Andyard Wu. Chapter 6, Deep eedforw eedforward Netw Networks orks orks:: Uriel Berdugo, Fabrizio Bottarel, Elizab Elizabeth eth Burl, Ishan Durugk Durugkar, ar, Jeff Hlywa, Jong Wook Kim, Da David vid Krueger Chapter Feedforw ard and Adit Adity y6a, Deep Kumar Prahara Praharaj. j. Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Adity Kumar Praharafor j. Deep Learning: Inkyu Lee, Sunil Mohan and Chapter 7a, Regularization Josh Joshua ua Salisbury Salisbury.. Chapter 7, Regularization for Deep Learning: Inkyu Lee, Sunil Mohan and Joshua Salisbury . Chapter 8, Optimization for Training Deep Mo Models dels dels:: Marcel Ackermann, Ro Row wel Atienza, Andrew Bro Brock, ck, Tegan Mahara Maharaj, j, James Martens and Klaus Chapter 8, Optimization for Training Deep Models: Marcel Ackermann, Strobl. Rowel Atienza, Andrew Brock, Tegan Mahara j, James Martens and Klaus Strobl. 9, Conv Chapter Convolutional olutional Netw Networks orks orks:: Martín Arjovsky Arjovsky,, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Chapter 9, Conv olutional Networks Sa Say yer, Ryan Stout and Wentao Wu.: Martín Arjovsky, Eugene Brevdo, Eric Jensen, Asifullah Khan, Mehdi Mirza, Alex Paino, Eddie Pierce, Marjorie Sayer, Ryan and Modeling: Wentao Wu.Recurren Chapter 10, Stout Sequence Recurrentt and Recursive Nets Nets:: Gökçen Eraslan, Stev Steven en Hickson, Razv Razvan an Pascan Pascanu, u, Lorenzo von Ritter, Rui Ro Rodrigues, drigues, Chapter 10 , Sequence Modeling: Recurren t and Recursive Nets : Gökçen Mihaela Rosca, Dmitriy Serdyuk, Dongyu Shi and Kaiyu Yang. Eraslan, Steven Hickson, Razvan Pascanu, Lorenzo von Ritter, Rui Rodrigues, Mihaela Dmitriy Serdyuk, and Kaiyu Yang. Chapter Rosca, 11, Practical metho methodology dologyDongyu : DanielShi Bec Beckstein. kstein. 11, Applications Practical metho dologyDahl : Daniel kstein. Chapter 12 : George and Bec Ribana Roscher. 12, Representation Applications: George Dahl and Ribana Chapter 15 Learning : Kunal Ghosh.Roscher.
Learning: Mo Kunal Chapter 15 16, Representation Structured Probabilistic Models dels Ghosh. for Deep Learning: Minh Lê and Anton Varfolom. Chapter 16, Structured Probabilistic Models for Deep Learning: Minh Lê • and Chapter 18,VConfronting the Partition Function unction:: Sam Bowman. Anton arfolom. ix Chapter 18, Confronting the Partition Function: Sam Bowman.
•
CONTENTS
• Chapter 20, Deep Generativ Generativee Mo Models dels dels:: Nicolas Chapados, Daniel Galvez, Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Monta Montav von. Chapter 20, Deep Generative Models: Nicolas Chapados, Daniel Galvez, enming Ma, Fady Medhat, • W Bibliograph Bibliography: y: Leslie N. Smith.Shakir Mohamed and Grégoire Montavon. Bibliography: Leslie N. Smith. We also wan antt to thank those who allo allow wed us to repro reproduce duce images, figures or data• from their publications. We indicate their contributions in the figure captions We also the wantext. t to thank those who allowed us to reproduce images, figures or throughout data from their publications. We indicate their contributions in the figure captions We w would ould like e to thank Ian’s wife Daniela Flori Goo Goodfellow dfellow for patiently throughout the lik text. supp supporting orting Ian during the writing of the book as well as for help with pro proofreading. ofreading. We would like to thank Ian’s wife Daniela Flori Goodfellow for patiently We would lik likee to thank the Go Google ogle Brain team for pro providing viding an intellectual supporting Ian during the writing of the book as well as for help with proofreading. en environmen vironmen vironmentt where Ian could dev devote ote a tremendous amoun amountt of time to writing this W e w ould lik e to thank the Go ogle Brain team for pro intellectual book and receiv receivee feedbac feedback k and guidance from colleagues. Weviding wouldan esp especially ecially like en vironmen t where Ian could dev ote a tremendous amoun t of time to writing this to thank Ian’s former manager, Greg Corrado, and his current manager, Samy book andforreceiv feedbac colleagues. Welike would especially like Bengio, theire supp support ortkofand thisguidance pro project. ject. from Finally Finally, , we would to thank Geoffrey to thank former manager, Greg Corrado, and his current manager, Samy Hin Hinton ton forIan’s encouragement when writing was difficult. Bengio, for their support of this pro ject. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.
x
Notation Notation
This section pro provides vides a concise reference describing the notation used throughout this book. If you are unfamiliar with an any y of the corresp corresponding onding mathematical This section pro vides a concise reference describing the notation throughout concepts, this notation reference ma may y seem in intimidating. timidating. Ho How wev ever, er,used do not despair, this b o ok. If y ou are unfamiliar with an y of the corresp onding mathematical we describ describee most of these ideas in chapters 2-4. concepts, this notation reference may seem intimidating. However, do not despair, we describe most of these ideas Num in chapters 2-4. Arra Numb bers and Arrays ys a
A scalar (in (integer teger Num or real) bers and Arrays A scalar vector (integer or real)
A a
A matrix vector
A A IAn
A tensor matrix
II
Iden Identit tity matrix with withndimensionalit dimensionality y implied by tit y matrix rows and n columns con context text Identity matrix with dimensionality implied by Standard context basis vector [0, . . . , 0, 1, 0, . . . , 0] with a 1 at position i Standard basis vector [0, . . . , 0, 1, 0, . . . , 0] with a A square, diagonal matrix with diagonal entries 1 at position i giv given en by a A square, diagonal matrix with diagonal entries A variable givscalar en by random a
a
I e(i) e diag diag((a) diag(a) a
Iden Identit tit tity y matrix with n ro rows ws and n columns A tensor
a
A scalar vector-v ector-valued alued random A random variablevariable
A a
A matrix-v matrix-valued alued random random vvariable ariable vector-valued
A
A matrix-valued random variable
xi xi
CONTENTS
Sets and Graphs A set
A A R
Sets and Graphs The umbers bers A setset of real num
{0R , 1}
The containing taining and 1 The set set con of real num0bers
{0, 10, ,. 1. . , n} The set con of all in integers tegers betw between taining 0 and 1 een 0 and n The real interv terv terval altegers including and b n 0, {1[a, , . .b.]}, n set ofin all in betwaeen 0 and { ([a, The real real in interv a, bb]] } The interv terval excluding aa and but bincluding b al including A (a,\B b] A B G \ P a G(xi ) P a G(x )
Set the set containing The subtraction, real interval i.e., excluding a but includingthe b elemen ments ts of A that are not in B Set subtraction, i.e., the set containing the eleA B A graph men ts of that are not in The paren parents ts of xi in G A graph
aa−i
The parents of x in Indexing G Elemen Elementt i of vector a , withIndexing indexing starting at 1 All elemen elements ts vofector vector except for elemen element t i at 1 a, a Elemen t i of with indexing starting
A a i,j
Elemen Element t i, ts j of Aexcept for element i All elemen of matrix vector a
A A i,:
Ro Row w i of A Elemen t i,matrix j of matrix A
A :,i
Column of matrix Row i ofimatrix A A
AAi,j,k A A
Elemen Element k) of aA 3-D tensor A Columnt i(i,ofj,matrix A 2-D slice a k3-D Elemen t (of i, j, ) oftensor a 3-D tensor
Aa i
Elemen Element t iofofathe 2-D slice 3-Drandom tensor vector a
ai
:,:,i
a A> A+
Element i of the random vector a Linear Algebra Op Operations erations Transpose of matrix LinearAAlgebra Op erations Mo Moore-P ore-P ore-Penrose enrose pseudoin pseudoinv T ranspose of matrix A verse of A
AA B
Elemen Element-wise t-wise product Mo ore-P enrose(Hadamard) pseudoinverse of A of A and B
det( A A B) A) det(
Determinan Determinant t of A Elemen t-wise (Hadamard) product of A and B Determinant of A xii
CONTENTS
Calculus Deriv Derivative ative of y with resp respect ect to x. Calculus Derivative of y with respect to x. Partial deriv derivative ative of y with resp respect ect to x
dy dx dy ∂y dx ∂x ∂y ∇ xy ∂x ∇Xyy ∇ Xyy ∇
Gradien Gradient t of yative withofresp respect ect to x ect to x Partial deriv y with resp Matrix derivatives y ect withtoresp respect Gradienderiv t of yatives with of resp x ect to X
∇ y ∂f ∇∂ x ∂f ∇2x f (x) or H (f )(x) Z ∂x (xH )dx f (x)for (f )(x) Z ∇ f (x)dx S
f (x)dx
Z
a⊥b
Z a⊥b | c a b (a) c aP ⊥ b
T ensor con containing taining derivatives of yect with respect Matrix deriv atives deriv of y atives with resp to resp X ect to X Tensor containing derivatives of y with respect to X Jacobian matrix J ∈ Rm×n of f : Rn → Rm
R R R The Hessian matrix input Jacobian matrix J of f at of f : point x ∈ the en → of x Definite in integral tegral over entire tire domain The Hessian matrix of f at input point x
Definite in tegral with over the entire of xset S integral respect to domain x ov over er the S Definite integral with respect to x over the set Probabilit Probability y and Information Theory The random ariables a and Theory b are indep independen enden endentt Probabilit y andvInformation They are are vconditionally given en ct The random ariables a andindependent b are indepgiv enden
⊥ Pp(a)| p(a) a∼P
A probabilit probability distribution oindependent ver a discretegiv variable They are arey conditionally en c A probabilit probability y distribution distribution oovver er aa discrete contin continuous uous varivariable able, or over a variable whose type has not been A ecified probability distribution over a continuous varisp specified able, or over a variable whose type has not been Random sp ecified variable a has distribution P
Ex∼P [f (ax)] P or Ef (x) E V f (or x))Ef (x) [far( (x∼ )]
V ariance of fof (x)f (under P (resp x) ect to P (x) Exp ectation x) with
Co Cov( v(ar( f (fx()x , g))(x)) V
Co Cov variance ) and Pg(x) under P (x) V ariance of of f (xf)(xunder
Cov(fH((xx)), g(x))
Shannon en entrop trop tropy the random variable g(x) under Covariance of fy(xof ) and P (x) x
D KL H((Pxk) Q)
Kullbac Kullback-Leibler divergence of Pvand Q x Shannonk-Leibler entropy div of ergence the random ariable
N Σ) D (x(;Pµ, Q (x; µk, Σ)
Gaussian distribution ov over er x Kullbac k-Leibler divergence of with P andmean Q µ and co cov variance Σ Gaussian distribution over x with mean µ and covariance Σ
N
Exp Expectation ectation of f (xa)has with resp respect ect to P (x) Random variable distribution
xiii
CONTENTS
f :A→B A B f :f ◦ g f f(x;→ θ g)
Functions The function f with domain A and range B Functions A and g B Comp Composition osition of the functions The function f with domain f and range
||ζx (x||)p
A function parametrized Comp ositionofofx the functions f by andθ.g Sometimes we just write f (x) and ignore the argument θ to A function of x parametrized by θ. Sometimes ligh lighten ten notation. we just write f (x) and ignore the argument θ to Natural logarithm of x ligh ten notation. 1 Logistic sigmoid, of x Natural logarithm 1 + exp(−x) 1 Logistic sigmoid, Softplus, log log(1 (1 + exp( x )) 1 + exp( x) p L norm of Softplus, logx(1 + exp(x)) −
||xx|| || xx+||
P ositive e part norm of x of x, i.e., max(0, x) Lositiv
f (x◦; θ) log x σ (xx) log ζσ((xx))
L2 norm of x
||x || 1 condition is 1 if the condition is true, 0 ,otherwise Positiv e part of x, i.e., max(0 x) Sometimes we use a function f whose argumen argumentt is a scalar, but apply it to a vector, 1 is 1 if the condition is true, 0 otherwise matrix, or tensor: f (x), f ( X), or f (X). This means to apply f to the arra array y f Sometimes w e use a function whose argumen t is a scalar, but apply it to a v ector, elemen element-wise. t-wise. For example, if C = σ (X)X, then Ci,j,k = σ(Xi,j,k ) for all valid values f (x), f ( X), or f ( ). This means to apply f to the array matrix, or ktensor: of i, j and . C X C X element-wise. For example, if = σ ( ), then ) for all valid values = σ( of i, j and k. Datasets and distributions p data pˆdata pˆ X (i) xX y (i) xor y (i) y
or y X X
The data generating distribution Datasets and distributions The empirical distribution defined by the training data generating distribution set The empirical distribution defined by the training A setset of training examples The exampleexamples (input) from a dataset A seti-th of training ( i) The target asso associated ciated with x supervised ervised learni-th example (input) fromfora sup dataset ing The target associated with x for supervised learn(i) The row w ing m × n matrix with input example x in ro Xi,: The m n matrix with input example x in row X ×
xiv
Chapter 1 Chapter 1
In Intro tro troduction duction tro In duction In Inv ven entors tors ha hav ve long dreamed of creating mac machines hines that think. This desire dates bac back k to at least the time of ancien ancientt Greece. The mythical figures Pygmalion, Inventors ha ve long dreamedma ofy creating machines that think. This Daedalus, and Hephaestus may all be interpreted as legendary in inv vdesire en entors, tors,dates and bac k to at least the time of ancien t Greece. The m ythical figures Pygmalion, Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, Daedalus, and Hephaestus y ). all be interpreted as legendary inventors, and 2004 ; Spark Sparkes es, 1996 ; Tandy, ma 1997 Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, programmable computers conceived, ed, people wondered whether 2004When ; Spark es, 1996; Tandy , 1997). were first conceiv they migh mightt become intelligen intelligent, t, ov over er a hundred years before one was built (Lo Lov velace elace,, When programmable computers w ere first conceiv ed, p eople w ondered whether 1842 1842). ). Today oday,, artificial intel intelligenc ligenc ligencee (AI) is a thriving field with man many y practical they migh t b ecome intelligen t, ov er a h undred y ears b efore one was built Lovelace, applications and active researc research h topics. We lo look ok to intelligen intelligentt softw software are to (automate 1842 ). Tlab oday intel ligenc is a thriving field withinman y practical routine understand sp images, mak medicine and labor, or,, artificial speech eech eor(AI) makee diagnoses applications and active researc h topics. W e lo ok to intelligen t softw are to automate supp support ort basic scientific research. routine labor, understand speech or images, make diagnoses in medicine and Inort thebasic earlyscientific da days ys of research. artificial in intelligence, telligence, the field rapidly tackled and solved supp problems that are intellectually difficult for human beings but relativ relatively ely straightIn the early da ys of artificial in telligence, the field rapidly tackled and solved forw for computers—problems that can b e describ by a list of formal, mathforward ard described ed problems that are intellectually difficult for human b eings but relativ ely straightematical rules. The true challenge to artificial intelligence prov proved ed to be solving forw ard for computers—problems that can b e describ ed by a list of formal, maththe tasks that are easy for people to perform but hard for people to describ describe e ematical rules. The true challenge to artificial intelligence prov ed to b e solving formally—problems that we solve intuitiv intuitively ely ely,, that feel automatic, like recognizing the tasks that are easy for p eople to p erform but hard for people to describe sp spok ok oken en words or faces in images. formally—problems that we solve intuitively, that feel automatic, like recognizing This book is ab about out a solution to these more intuitiv intuitivee problems. This solution is spoken words or faces in images. to allow computers to learn from exp experience erience and understand the world in terms of a This b o ok is ab out a solution to these more intuitiv e problems. This solution is hierarc hierarch hy of concepts, with each concept defined in terms of its relation to simpler to allow computers to learn from exp erience and understand the world in terms of a concepts. By gathering knowledge from experience, this approac approach h av avoids oids the need hierarc hy ofop concepts, with each concept in terms of its that relation simpler for human operators erators to formally sp specify ecify defined all of the knowledge the to computer concepts. gathering knowledge fromthe experience, approac h avoids concepts the need needs. TheBy hierarc hierarch hy of concepts allows computerthis to learn complicated for h uman op erators to formally sp ecify all of the knowledge that the computer by building them out of simpler ones. If we draw a graph showing how these needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. 1If we draw a graph showing how these 1
CHAPTER 1. INTRODUCTION
concepts are built on top of eac each h other, the graph is deep, with man many y lay layers. ers. For this reason, we call this approach to AI de deep ep le learning arning arning.. concepts are built on top of each other, the graph is deep, with many layers. For Man Many y of the early successes of AI took place in relativ relatively ely sterile and formal this reason, we call this approach to AI deep learning. en environmen vironmen vironments ts and did not require computers to ha hav ve muc uch h kno knowledge wledge ab about out Man y of the early successes of AI took place in relativ ely sterile and formal the world. F For or example, IBM’s Deep Blue chess-pla chess-playing ying system defeated world vironmen ts and did not computers haveismofuccourse h knowledge about cen hampion Garry Kasparo Kasparov v inrequire 1997 (Hsu , 2002). to Chess a very simple the world. For example, IBM’s Blue and chess-pla ying orlde w orld, con containing taining only sixt sixty-four y-fourDeep lo locations cations thirt thirty-t y-t y-tw wosystem pieces defeated that can w mov move champion Garrycircumscrib Kasparov in , 2002). aChess is of course very simple in only rigidly circumscribed ed 1997 ways.(Hsu Devising successful chess astrategy is a w orld, con taining only sixt y-four lo cations and thirt y-t w o pieces that can mov tremendous accomplishmen accomplishment, t, but the challenge is not due to the difficulty ofe in only rigidly circumscrib ed ways. successful strategyChess is a describing the set of chess pieces and Devising allo allow wableamov moves es to thechess computer. tremendous accomplishmen the brief challenge not due to the difficulty of can be completely describ described ed t,bybut a very list ofiscompletely formal rules, easily describing the set of chess pieces and allowable moves to the computer. Chess pro provided vided ahead of time by the programmer. can be completely described by a very brief list of completely formal rules, easily Ironically Ironically,, abstract and formal tasks that are among the most difficult mental provided ahead of time by the programmer. undertakings for a human being are among the easiest for a computer. Computers , abstract formal tasks among the most difficult mental ha hav veIronically long been able toand defeat even thethat bestare human chess play player, er, but are only undertakings for a h uman b eing are among the easiest for a computer. Computers recen recently tly matching some of the abilities of average human beings to recognize ob objects jects ha v e long b een able to defeat even the b est human chess play er, but are only or sp speech. eech. A person’s everyda everyday y life requires an immense amount of kno knowledge wledge recen matching some of of aviserage human beings to recognize ob jects ab about outtly the world. Much of the thisabilities kno knowledge wledge sub subjectiv jectiv jective e and intuitiv intuitive, e, and therefore or speech. person’s in everyda y lifewarequires an immense of kno difficult to A articulate a formal y. Computers need amount to capture thiswledge same ab out the world. Much of this kno wledge is sub jectiv e and intuitiv e, and therefore kno knowledge wledge in order to beha ehav ve in an in intelligen telligen telligentt way. One of the key challenges in difficult to articulate in a formal w a y . Computers need into to capture this same artificial in intelligence telligence is how to get this informal kno knowledge wledge a computer. knowledge in order to behave in an intelligent way. One of the key challenges in Sev Several eral artificial in intelligence telligence pro projects jects hav havee sought to hard-co hard-code de knowledge ab about out artificial intelligence is how to get this informal knowledge into a computer. the worl world d in formal languages. A computer can reason ab about out statements in these Sev eral artificial in telligence pro jects hav e sought to hard-co de knowledge abthe out formal languages automatically using logical inference rules. This is kno known wn as the worl d in formal languages. A computer can reason ab out statements in these know knowle le ledge dge base approach to artificial in intelligence. telligence. None of these pro projects jects has led to automatically using logical inference rules. This is knoand wn as the, aformal ma major jorlanguages success. One of the most famous such pro projects jects is Cyc (Lenat Guha know). ledge base approach to artificial intelligence. None these pro jects led to 1989 1989). Cyc is an inference engine and a database of of statements in a has language a ma jor success. One of the most famous such pro jects is Cycsup (Lenat andItGuha called CycL. These statements are en entered tered by a staff of human supervisors. ervisors. is an, 1989 ). Cyc is an inference engine and a database of statements in a language un unwieldy wieldy pro process. cess. People struggle to devise formal rules with enough complexity called CycL. These statements are en tered by a staff human ervisors. aIt story is an to accurately describ describe e the world. For example, Cycoffailed to sup understand unwieldy cess. People struggle to in devise formal rules with enough ab about out a ppro erson named Fred shaving the morning (Linde , 1992 ). Itscomplexity inference to accurately describ e the world. F or example, Cyc failed to understand a story engine detected an inconsistency in the story: it knew that people do not ha hav ve ab out a p erson named F red shaving in the morning ( Linde , 1992 ). Its inference electrical parts, but because Fred was holding an electric razor, it believed the engine detected an inconsistency in the story: parts. it knewIt that people do have en entit tit tity y “F “FredWhileShaving” redWhileShaving” contained electrical therefore ask asked ednot whether electrical parts, because Fred as holding F red was still a pbut erson while he waswsha shaving. ving. an electric razor, it believed the entity “FredWhileShaving” contained electrical parts. It therefore asked whether The difficulties faced by systems relying on hard-coded kno knowledge wledge suggest that Fred was still a person while he was shaving. AI systems need the ability to acquire their own kno knowledge, wledge, by extracting patterns The faced by systems relyingason hard-coded knowledge suggest that from ra raw wdifficulties data. This capabilit capability y is known machine le learning arning arning. . The in intro tro troduction duction AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known 2 as machine learning. The intro duction
CHAPTER 1. INTRODUCTION
of mac machine hine learning allo allow wed computers to tackle problems inv involving olving knowledge of the real world and mak makee decisions that app appear ear sub subjective. jective. A simple machine of machine learningcalled allowlo ed computers to can tackle problems involving knowledge learning algorithm logistic gistic regr gression ession determine whether to recommend of the real world mak e decisions that ear submachine jective. learning A simplealgorithm machine cesarean deliv delivery ery and (Mor-Y Mor-Yosef osef et al. al.,, 1990 ). app A simple learning algorithm called lo gistic r e gr ession can determine whether to recommend called naive Bayes can separate legitimate e-mail from spam e-mail. cesarean delivery (Mor-Yosef et al., 1990). A simple machine learning algorithm The performance of these simple machine learning algorithms dep depends ends heavily called naive Bayes can separate legitimate e-mail from spam e-mail. on the repr epresentation esentation of the data they are given. For example, when logistic The p erformance of these simple machine learning algorithms dep heavily regression is used to recommend cesarean deliv delivery ery ery,, the AI system do does es ends not examine on the r epr esentation of the data they are given. F or example, when logistic the patient directly directly.. Instead, the do doctor ctor tells the system several pieces of relev relevan an antt regression is used cesarean deliveryof, the AI system does not examine information, suc such htoasrecommend the presence or absence a uterine scar. Each piece of the patient directly . Instead, the doctor tellsofthe several piecesasofa relev anet. information included in the represen representation tation thesystem patient is known fe featur atur ature information, such learns as the ho presence orthese absence of a of uterine scar. correlates Each piece of Logistic regression how w eac each h of features the patient with information included in the represen tation of the patient is known as a fe atur e. various outcomes. Ho How wev ever, er, it cannot influence the wa way y that the features are Logistic in regression how eac h of thesewas features patient with defined any wa way ylearns . If logistic regression given of anthe MRI scan correlates of the patient, variousthan outcomes. Howev er, it cannot influence the not waybethat features are rather the do doctor’s ctor’s formalized rep report, ort, it would ablethe to mak make e useful defined in any way. If pixels logisticinregression washa given an MRI correlation scan of thewith patient, predictions. Individual an MRI scan hav ve negligible an any y rather than the do ctor’s formalized rep ort, it would not be able to mak e useful complications that might occur during delivery delivery.. predictions. Individual pixels in an MRI scan have negligible correlation with any This dep dependence endence on represen representations tations is a general phenomenon that app appears ears complications that might occur during delivery. throughout computer science and even daily life. In computer science, operaThis dep on arepresen tations is a can general phenomenon that faster appears tions suc such h asendence searching collection of data pro proceed ceed exp exponentially onentially if throughout computer science and even daily life. In computer science, operathe collection is structured and indexed intelligen intelligently tly tly.. P People eople can easily perform tions such as a collection data can proceed exponentially faster if arithmetic on searching Arabic numerals, but of find arithmetic on Roman numerals muc uch h the collection is structured and indexed intelligen tly . P eople can easily p erform more time-consuming. It is not surprising that the choice of represen representation tation has an arithmetic on Arabic numerals, but find arithmetic on Roman n umerals much enormous effect on the performance of mac machine hine learning algorithms. For a simple more time-consuming. It is not surprising that the choice of represen tation has an visual example, see Fig. 1.1. enormous effect on the performance of machine learning algorithms. For a simple Man Many y artificial tasks can be solv solved ed by designing the righ rightt set of visual example, see intelligence Fig. 1.1. features to extract for that task, then providing these features to a simple machine Manyalgorithm. artificial intelligence tasks can be solvedforbysp designing the right set of learning For example, a useful feature speak eak eaker er iden identification tification from features to extract for that task, then providing these features to a simple machine sound is an estimate of the size of speaker’s vocal tract. It therefore giv gives es a strong learning Forsp example, usefulwoman, featureorforchild. speaker identification from clue as toalgorithm. whether the speaker eaker is aaman, sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong Ho How wev ever, er, for man many y tasks, it is difficult to know what features should be extracted. clue as to whether the speaker is a man, woman, or child. For example, supp suppose ose that we would lik likee to write a program to detect cars in Ho w ev er, for man y tasks, it is difficult to know features e extracted. photographs. We know that cars ha hav ve wheels, so what we might like should to use b the presence Fora example, ose that we would lik write a to program detect what cars in of wheel as asupp feature. Unfortunately Unfortunately, , ite istodifficult describ describeto e exactly a photographs. W e know that cars ha v e wheels, so w e might like to use the presence wheel lo looks oks like in terms of pixel values. A wheel has a simple geometric shap shapee but of a wheel as a feature. Unfortunately , it is difficult to describ e exactly whatoffa its image may be complicated by shadows falling on the wheel, the sun glaring wheel lo oks like in terms of pixel v alues. A wheel has a simple geometric shap e but the metal parts of the wheel, the fender of the car or an ob object ject in the foreground its image may e complicated by so shadows obscuring partbof the wheel, and on. falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an ob ject in the foreground obscuring part of the wheel, and so on. 3
CHAPTER 1. INTRODUCTION
Polar coordinates
Cartesian coordinates
Polar coordinates
µ
y
Cartesian coordinates
x
r
Figure 1.1: Example of differen differentt represen representations: tations: suppose we wan antt to separate ttw wo categories of data by dra drawing wing a line betw between een them in a scatterplot. In the plot on the left, Figure 1.1: some Example differen t represen tations:and suppose weis w antossible. to separate two w e represent data of using Cartesian co coordinates, ordinates, the task imp impossible. In the plot categories of data b y dra wing a line betw een them in a scatterplot. In the plot on the on the right, we represent the data with p olar coordinates and the task b ecomes simpleleft, to we represent some data using Cartesian co ordinates, and the task imp ossible. In the plot solv solve e with a vertical line. (Figure pro produced duced in collab collaboration oration withis David Warde-F arde-Farley) arley) on the right, we represent the data with p olar coordinates and the task b ecomes simple to solve with a vertical line. (Figure pro duced in collab oration with David Warde-Farley)
One solution to this problem is to use machine learning to disco discov ver not only the mapping from represen representation tation to output but also the representation itself. solution to this as problem is to usele machine learning representations to discover not often only ThisOne approach is known repr epresentation esentation learning arning arning.. Learned the mapping represen tation tothan output also the representation itself. result in muc uch hfrom better performance can but be obtained with hand-designed This approach is known as r epr esentation le arning . Learned representations represen representations. tations. They also allow AI systems to rapidly adapt to new tasks, often with result in hm uch bin etter p erformance than can b e obtained with hand-designed minimal uman A representation learning algorithm can discov interv terv terven en ention. tion. discover er a represen tations. They also allow AI systems to rapidly adapt to new tasks, with go goo od set of features for a simple task in min minutes, utes, or a complex task in hours to minimal h uman in terv en tion. A representation learning discov mon months. ths. Manually designing features for a complex taskalgorithm requires acan great dealerofa go o d set of features for a simple task in min utes, or a complex task in hours to human time and effort; it can tak takee decades for an en entire tire communit community y of researc researchers. hers. months. Manually designing features for a complex task requires a great deal of The quintessen quintessential tial example of a represen representation tation learning algorithm is the auhuman time and effort; it can take decades for an entire community of researchers. to toenc enc enco oder der.. An auto autoenco enco encoder der is the com combination bination of an enc enco oder function that con conv verts The quintessen tial example of a represen tation learning algorithm is the authe input data into a different representation, and a de deccoder function that con conv verts to enc o der . An auto enco der is the com bination of an enc o der function that con v erts the new representation bac back k into the original format. Auto Autoenco enco encoders ders are trained to the input data into a different representation, and a de c o der function verts preserv preservee as muc uch h information as possible when an input is run throughthat the con enco encoder der the new backare into thetrained originaltoformat. Auto enco ders are trained toe and thenrepresentation the deco decoder, der, but also make the new representation hav have preserv as mprop uch information as possible anenco input is aim run through the encoder v ariousenice properties. erties. Different kinds when of auto autoenco encoders ders to achiev achieve e different and then the deco der, but are also trained to make the new representation have kinds of prop properties. erties. various nice properties. Different kinds of autoencoders aim to achieve different When designing kinds of prop erties. features or algorithms for learning features, our goal is usually to separate the factors of variation that explain the observed data. In this context, When features or algorithms learning features, our goalthe is usually we use the designing word “factors” simply to refer to for separate sources of influence; factors to separate the factors of variation that explain the observed data. In this context, are usually not combined by multiplication. Suc Such h factors are often not quantities we use the word “factors” simply to refer to separate sources of influence; the factors 4 are usually not combined by multiplication. Such factors are often not quantities
CHAPTER 1. INTRODUCTION
that are directly observed. Instead, they may exist either as unobserv unobserved ed ob objects jects or unobserved forces in the ph physical ysical world that affect observ observable able quan quantities. tities. They that are directly may exist either as unobserv ed objects ma may y also exist asobserved. constructsInstead, in the hthey uman mind that pro provide vide useful simplifying or unobservedorforces in the physical world that tities. They explanations inferred causes of the observ observed ed affect data. observ They able can bquan e thought of as ma y also exist as constructs in the h uman mind that pro vide useful simplifying concepts or abstractions that help us make sense of the rich variabilit ariability y in the data. explanations or inferred of thethe observ ed of data. They can be thought ofer’s as When analyzing a sp speech eechcauses recording, factors variation include the sp speak eak eaker’s concepts abstractions that makethat sense of the variabilit y inanalyzing the data. age, theiror sex, their accent andhelp theus words they are rich sp speaking. eaking. When When analyzing speech recording, the factors variation er’s an image of a car,a the factors of variation includeofthe positioninclude of the the car, sp itseak color, age, their sex, their accent and the words that they are sp eaking. When analyzing and the angle and brightness of the sun. an image of a car, the factors of variation include the position of the car, its color, ma major jor source of difficult difficulty yofin the many real-w real-world orld artificial intelligence applications andAthe angle and brightness sun. is that many of the factors of variation influence ev every ery single piece of data we are A ma jor source of difficult y in many real-w orld artificial able to observe. The individual pixels in an image of a red intelligence car migh mightt bapplications e very close is that many of the factors of v ariation influence ev ery single piece of data are to black at night. The shap shapee of the car’s silhouette dep depends ends on the viewingwe angle. able toapplications observe. The individual pixels in anthe image of a of redvariation car mighand t bediscard very close Most require us to disentangle factors the to black at night. The shap e of the car’s silhouette dep ends on the viewing angle. ones that we do not care about. Most applications require us to disentangle the factors of variation and discard the Of course, it can be very difficult to extract such high-level, abstract features ones that we do not care about. from ra raw w data. Man Many y of these factors of variation, such as a sp speak eak eaker’s er’s accen accent, t, it can very sophisticated, difficult to extract such high-level, abstract features can Of be course, iden identified tified onlybeusing nearly human-lev human-level el understanding of fromdata. raw When data. it Man y of these factors to of obtain variation, such as a speak er’s accen t, the is nearly as difficult a representation as to solve the can be iden tified representation only using sophisticated, nearly human-lev el understanding of original problem, learning do does es not, at first glance, seem to help us. the data. When it is nearly as difficult to obtain a representation as to solve the De Deep epproblem, le learning arning solv solves es this central problem in not, represen representation tation learning original representation learning does at first glance, seembytointroduchelp us. ing represen representations tations that are expressed in terms of other, simpler representations. Delearning ep learning solves central problem represen tation learning by introducDeep allows thethis computer to build in complex concepts out of simpler coning represen tations that are expressed in terms of other, simpler representations. cepts. Fig. 1.2 shows how a deep learning system can represen representt the concept of an Deep learning allows the computer to build complex concepts out ofand simpler conimage of a person by combining simpler concepts, such as corners contours, cepts. Fig.in1.2 shows how in a deep system can represent the concept of an whic which h are turn defined termslearning of edges. image of a person by combining simpler concepts, such as corners and contours, quin quintessen tessential tial example of aofdeep learning mo model del is the feedforw feedforward ard deep whicThe h are in tessen turn defined in terms edges. net netw work or multilayer per ercceptr eptron on (MLP). A multila ultilay yer perceptron is just a matheThe quin tessen tial example of a deep learning mo del is the feedforw deep matical function mapping some set of input values to output values. The ard function netformed work orbymultilayer permany ceptron (MLP). A multila is just a matheis comp composing osing simpler functions. Weyer canperceptron think of each application matical function mapping some set of values to output values. The function of a different mathematical function as input pro providing viding a new representation of the input. is formed by composing many simpler functions. We can think of each application idea of learning thefunction right represen representation the representation data provides one persp erspececof aThe different mathematical as protation viding for a new of the input. tiv tivee on deep learning. Another persp erspective ective on deep learning is that depth allows the The idea of learning the right represen tation forEac theh data one persp eccomputer to learn a multi-step computer program. Each lay layer erprovides of the represen representation tation tive b one deep learning. Another on deepmemory learningafter is that depth allows the can thought of as the state pofersp theective computer’s executing another computer to learn a m ulti-step computer program. Eac h lay er of the represen tation set of instructions in parallel. Net Netw works with greater depth can execute more can b e thought of as the state of the computer’s memory after executing another instructions in sequence. Sequential instructions offer great pow ower er because later set of instructions in back parallel. worksofwith greater depth can executetomore instructions can refer to theNet results earlier instructions. According this instructions in sequence. Sequential instructions offer great power because later instructions can refer back to the results5 of earlier instructions. According to this
CHAPTER 1. INTRODUCTION
CAR
PERSON
ANIMAL
Output (object identity)
3rd hidden layer (object parts)
2nd hidden layer (corners and contours)
1st hidden layer (edges)
Visible layer (input pixels)
Figure 1.2: Illustration of a deep learning mo model. del. It is difficult for a computer to understand the meaning of ra raw w sensory input data, suc such h as this image represen represented ted as a collection Figure Illustration of a deep learningfrom mo del. It isofdifficult foran a computer to tity understand of pixel1.2: values. The function mapping a set pixels to ob object ject iden identity is very the meaning Learning of raw sensory input data, such as seems this image represented as a collection complicated. or ev evaluating aluating this mapping insurmountable if tackled directly directly.. of pixel values.resolves The function mapping a setthe of desired pixels to an ob ject iden tity isinto very Deep learning this difficult difficulty y by from breaking complicated mapping a complicated. Learning or ev aluating this mapping seems insurmountable if tackled directly series of nested simple mappings, each describ described ed by a different lay layer er of the mo model. del. The. Deep learning resolves this difficult y ,bysobreaking desired complicated mappingthat intowea input is presented at the visible layer named bthe ecause it contains the variables series of to nested simple mappings, each describ ed by a different layer ofabstract the mo del. The are able observe. Then a series of hidden layers extracts increasingly features visible layer input is presented at the , so named b ecause it contains the v ariables that from the image. These lay layers ers are called “hidden” b ecause their values are not given we in hidden layers are able observe.the Then a series extracts increasingly abstract features the data;toinstead mo model del mustofdetermine which concepts are useful for explaining fromrelationships the image. These ers are data. called The “hidden” b ecause their values are not given in the in the lay observed images here are visualizations of the kind the data; instead the mo del must determine which concepts are useful for explaining of feature represented by each hidden unit. Giv Given en the pixels, the first lay layer er can easily the relationships the observed data. Theofimages here are visualizations of the kind iden identify tify edges, by in comparing the brightness neighboring pixels. Given the first hidden of represented byedges, each hidden unit. Givenlay the pixels, thesearch first lay can easily la lay yfeature er’s description of the the second hidden layer er can easily forercorners and iden tify edges, by comparing the brightness of neighboring pixels. Given the first hidden extended contours, which are recognizable as collections of edges. Given the second secondofhidden er can easily the search forhidden cornerslay and la lay yer’s description description of of the the edges, image the in terms cornerslay and contours, third layer er extended which recognizable ofecific edges. Given the hidden can detectcontours, entire parts ofare sp specific ecific ob objects, jects,asbycollections finding sp specific collections ofsecond contours and layer’s description of description the image in of corners andofcontours, third er corners. Finally Finally,, this of terms the image in terms the ob object jectthe parts it hidden containslay can can detect entire parts sp ecific ob jects, by finding ecific reproduced collections of contours and b e used to recognize theofob objects jects present in the image. sp Images with p ermission corners. Finally description from Zeiler and, Fthis ergus (2014). of the image in terms of the ob ject parts it contains can b e used to recognize the ob jects present in the image. Images reproduced with p ermission from Zeiler and Fergus (2014). 6
CHAPTER 1. INTRODUCTION
Element Set
+ ⇥ + ⇥
+
⇥ w1
Element Set
⇥
+ x1
w2
Logistic Regression
x2
Logistic Regression
w
x
Figure 1.3: Illustration of computational graphs mapping an input to an output where eac each h no node de p erforms an op operation. eration. Depth is the length of the longest path from input to Figure 1.3: Illustration of computational graphs mappingaan input to an output where output but dep depends ends on the definition of what constitutes p ossible computational step. each computation no de p erformsdepicted an op eration. is the length of the from input to The in theseDepth graphs is the output of alongest logisticpath regression mo model, del, T but dep ends on the definition of what constitutes a p ossible computational step. output σ(w x ), where σ is the logistic sigmoid function. If we use addition, multiplication and The computation in these graphs is the output of athen logistic del, logistic sigmoids asdepicted the elemen elements ts of our computer language, this regression mo model del has mo depth σ(w xIf),we where is the logistic sigmoid If we use multiplication and three. viewσ logistic regression as anfunction. element itself, thenaddition, this mo model del has depth one. logistic sigmoids as the elements of our computer language, then this mo del has depth three. If we view logistic regression as an element itself, then this mo del has depth one.
⇥
⇥
view of deep learning, not all of the information in a lay layer’s er’s activ activations ations necessarily enco encodes des factors of variation that explain the input. The representation also stores view of deep learning, not alltoofexecute the information a lay er’smake activsense ationsofnecessarily state information that helps a programinthat can the input. enco des factors of v ariation that explain the input. The representation also stores This state information could be analogous to a coun counter ter or pointer in a traditional state information that to execute a program can make the input., computer program. It helps has nothing to do with the that con conten ten tent t of the sense inputofsp specifically ecifically ecifically, This state information could b e analogous to a coun ter or p ointer in a traditional but it helps the mo model del to organize its processing. computer program. It has nothing to do with the content of the input specifically, There are tw two o main wa ways ys of measuring the depth of a mo model. del. The first view is but it helps the model to organize its processing. based on the num umber ber of sequen sequential tial instructions that must be executed to ev evaluate aluate There are tw o main wa ys of measuring the depth of a mo del. The first view is the arc architecture. hitecture. We can think of this as the length of the longest path through based on the n um ber of sequen tial instructions that must b e executed to ev aluate a flo flow w chart that describ describes es how to compute each of the mo model’s del’s outputs given theinputs. architecture. e ocan think oft computer this as theprograms length ofwill the hav longest path through its Just asWtw two equiv equivalen alen alent have e different lengths a flo w c hart that describ es how to compute each of the mo del’s outputs given dep depending ending on which language the program is written in, the same function may be its inputs. Just as tw o equiv alen t computer programs will hav e different lengths dra drawn wn as a flo flow wchart with differen differentt depths dep depending ending on whic which h functions we allow dep ending on which language the program is written in, the same be to be used as individual steps in the flo flow wchart. Fig. 1.3 illustratesfunction ho how w thismay choice dralanguage wn as a flo wcgive hart tw with differentmeasurements depths depending on whic functions we allow of can two o different for the sameharchitecture. to be used as individual steps in the flowchart. Fig. 1.3 illustrates how this choice Another approac approach, h, used by deep probabilistic mo models, dels, regards the depth of a of language can give two different measurements for the same architecture. mo model del as being not the depth of the computational graph but the depth of the Another approac used by deep probabilistic dels, In regards the depth of a graph describing howh,concepts are related to eachmo other. this case, the depth mothe del as bchart eing not the computations depth of the computational graph the but representation the depth of the of flow flowchart of the needed to compute of graph describing how concepts are related to each other. In this case, the depth 7 of the flowchart of the computations needed to compute the representation of
CHAPTER 1. INTRODUCTION
eac each h concept ma may y be muc uch h deep deeper er than the graph of the concepts themselv themselves. es. This is because the system’s understanding of the simpler concepts can be refined eac h concept mayab beout muc h more deepercomplex than the graph of concepts es. giv given en information about the concepts. Forthe example, anthemselv AI system This is because the of system’s understanding the simpler concepts cansee beone refined observing an image a face with one eye in of shadow ma may y initially only eye. giv en information ab out the more complex concepts. F or example, an AI system After detecting that a face is presen present, t, it can then infer that a second ey eyee is probably observing image a face eyeof in concepts shadow ma y initially only one eye. presen present t asan well. In of this case,with the one graph only includes twosee lay layers—a ers—a After detecting thata alay face t, it canthe then inferofthat a second eyeincludes is probably la lay yer for ey eyes es and layer er is forpresen faces—but graph computations 2n presen t as well. In this case, the graph of concepts only includes t w o lay ers—a la lay yers if we refine our estimate of each concept giv given en the other n times. layer for eyes and a layer for faces—but the graph of computations includes 2n Because it is not alw alwa ays clear whic which h of these tw two o views—the depth of the layers if we refine our estimate of each concept given the other n times. computational graph, or the depth of the probabilistic mo modeling deling graph—is most Because it is not alw a ys clear whic h of these tw o views—the depth of the relev relevant, ant, and because differen differentt people cho hoose ose different sets of smallest elemen elements ts computational graph, or the depth of the probabilistic mo deling graph—is most from whic which h to construct their graphs, there is no single correct value for the relev ant, and because differen eople is cho different sets of smallest elemenof ts depth of an arc architecture, hitecture, just taspthere noose single correct value for the length which program. to construct graphs, there isab noout single valueafor the afrom computer Northeir is there a consensus about ho how w correct muc much h depth mo model del depth of an arc hitecture, just as there is no single correct v alue for the length of requires to qualify as “deep.” Ho How wev ever, er, deep learning can safely be regarded as the a computer program. Nor isinv there aboutt of how mucosition h depthof alearned model study of mo models dels that either involve olveaaconsensus greater amoun amount comp composition requires to qualify as “deep.” Ho w ev er, deep learning can safely b e regarded as the functions or learned concepts than traditional mac machine hine learning does. study of models that either involve a greater amount of composition of learned To summarize, deep learning, the sub subject ject of this book, is an approach to AI. functions or learned concepts than traditional machine learning does. Sp Specifically ecifically ecifically,, it is a type of machine learning, a technique that allow allowss computer T o summarize, deep learning, the sub ject of this b o ok, is an AI. systems to impro improv ve with exp experience erience and data. A According ccording to the approach authors oftothis Sp ecifically , it is a type of machine learning, a technique that allow s computer book, mac machine hine learning is the only viable approac approach h to building AI systems that systems to impro ve with exp erience data. Ats. ccording to the authors of this can op operate erate in complicated, real-w real-world orldand environ environmen men ments. Deep learning is a particular book,ofmac hine the achiev only viable approac to building AIby systems that kind mac machine hinelearning learningisthat achieves es great pow power erh and flexibility learning to can op erate in complicated, real-w orld environ men ts. Deep learning is a particular represen representt the world as a nested hierarc hierarch hy of concepts, with each concept defined in kind of mac hine learning that achiev es great pow er and flexibility by learning to relation to simpler concepts, and more abstract representations computed in terms represen t the world as Fig. a nested hierarchy of with beach concept in of less abstract ones. 1.4 illustrates theconcepts, relationship et etw ween thesedefined different relation to simpler more abstract representations AI disciplines. Fig.concepts, 1.5 gives and a high-level schematic of ho how w eachcomputed works. in terms of less abstract ones. Fig. 1.4 illustrates the relationship between these different AI disciplines. Fig. 1.5 gives a high-level schematic of how each works.
1.1
Who Should Read This Bo Book? ok?
This ok can bShould e useful for a variet ariety y of readers, but we wrote it with two main 1.1 boWho Read This Book? target audiences in mind. One of these target audiences is universit university y students This b o ok can b e useful for a v ariet y of readers, but we wrote it with two main (undergraduate or graduate) learning ab about out machine learning, including those who target audiences in mind. One of these target audiences is universit y students are beginning a career in deep learning and artificial in intelligence telligence researc research. h. The (undergraduate or graduate) learning about machine who other target audience is softw software are engineers who do learning, not hav havee including a mac machine hinethose learning arestatistics beginning a career in but deepwan learning and artificial telligence researc h. deep The or backgr background, ound, want t to rapidly acquire in one and begin using other target audience is softw are engineers do not e a mac hine learning learning in their pro product duct or platform. Deep who learning hashav already pro prov ven useful in or statistics backgr ound, but wan t to rapidly acquire one and b egin using deep man many y soft softw ware disciplines including computer vision, speech and audio pro processing, cessing, learning in their product or platform. Deep learning has already proven useful in 8 many software disciplines including computer vision, speech and audio processing,
CHAPTER 1. INTRODUCTION
Deep learning Example: MLPs
Example: Shallow autoencoders
Example: Logistic regression
Example: Knowledge bases
Representation learning
Machine learning
AI
Figure 1.4: A Venn diagram showing how deep learning is a kind of represen representation tation learning, whic is in turn a kind of mac learning, which is used for many but not all approaches which h machine hine Figure 1.4: A V enn diagram showing how deep learning is a kind of represen tation learning, to AI. Each section of the Venn diagram includes an example of an AI technology technology. . which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.
9
CHAPTER 1. INTRODUCTION
Output
Output
Output
Mapping from features
Output
Mapping from features
Mapping from features
Additional layers of more abstract features
Handdesigned program
Handdesigned features
Features
Simple features
Input
Input
Input
Input
Rule-based systems
Deep learning
Classic machine learning
Representation learning
Figure 1.5: Flow Flowcharts charts showing how the differen differentt parts of an AI system relate to eac each h other within different AI disciplines. Shaded b oxes indicate comp components onents that are able to Figurefrom 1.5: data. Flowcharts showing how the different parts of an AI system relate to each learn other within different AI disciplines. Shaded b oxes indicate comp onents that are able to learn from data.
10
CHAPTER 1. INTRODUCTION
natural language pro processing, cessing, rob robotics, otics, bioinformatics and chemistry chemistry,, video games, searc search h engines, online advertising and finance. natural language processing, robotics, bioinformatics and chemistry, video games, This book has been organized into three parts in order to best accommo accommodate date a search engines, online advertising and finance. variety of readers. Part I introduces basic mathematical to tools ols and machine learning This b o ok has been organized into three parts in order to best accommo date concepts. Part II describ describes es the most established deep learning algorithms that area vessen ariety of readers. Part I introduces mathematical ols and machine essentially tially solv solved ed tec technologies. hnologies. Partbasic III describ describes es moreto sp speculativ eculativ eculative e ideas learning that are concepts. P art I I describ es the most established deep learning algorithms that are widely believ elieved ed to be imp important ortant for future researc research h in deep learning. essentially solved technologies. Part III describes more speculative ideas that are Readers should feel free to skip parts that are not relev relevan an antt given their interests widely believed to be important for future research in deep learning. or background. Readers familiar with linear algebra, probability probability,, and fundamental Readers should feel freecan to skip skip Part partsI,that are not relev anreaders t given their interests mac machine hine learning concepts for example, while who just wan antt or background. Readers familiar with linear algebra, probability , and fundamental to implement a working system need not read bey eyond ond Part II. To help choose which mac hine learning concepts can skip Part I , for example, readers who just want chapters to read, Fig. 1.6 provides a flow flowchart chart sho showing wingwhile the high-level organization to implement a w orking system need not read b ey ond Part I I . T o help choose which of the book. chapters to read, Fig. 1.6 provides a flowchart showing the high-level organization We do assume that all readers come from a computer science bac background. kground. We of the book. assume familiarity with programming, a basic understanding of computational We do assume all readers come from a computer science bac We performance issues,that complexity theory theory, , in introductory troductory lev level el calculus andkground. some of the assume familiarity with programming, a basic understanding of computational terminology of graph theory theory. . performance issues, complexity theory, introductory level calculus and some of the terminology of graph theory.
1.2
Historical Trends in Deep Learning
It is easiest to understand deep learning with some historical con context. text. Rather than 1.2 Historical Trends in Deep Learning pro providing viding a detailed history of deep learning, we iden identify tify a few key trends: It is easiest to understand deep learning with some historical context. Rather than providing detailed has history deepand learning, we iden tifyhas a few key • Deepa learning had of a long ric rich h history history, , but gone bytrends: many names • • • • •
reflecting different philosophical viewp viewpoints, oints, and has waxed and waned in Deep learning has had a long and ric h history , but has gone by many names popularit opularity y. reflecting different philosophical viewpoints, and has waxed and waned in popularit y. Deep learning has become more useful as the amoun amountt of av available ailable training data has increased. Deep learning has become more useful as the amount of available training data has increased. Deep learning models ha hav ve gro grown wn in size over time as computer hardware and soft softw ware infrastructure for deep learning has improv improved. ed. Deep learning models have grown in size over time as computer hardware and soft ware infrastructure for deep complicated learning hasapplications improved. with increasing Deep learning has solv solved ed increasingly accuracy over time. Deep learning has solved increasingly complicated applications with increasing accuracy over time.
11
CHAPTER 1. INTRODUCTION
1. Introduction
Part I: Applied Math and Machine Learning Basics 2. Linear Algebra
3. Probability and Information Theory
4. Numerical Computation
5. Machine Learning Basics
Part II: Deep Networks: Modern Practices 6. Deep Feedforward Networks
7. Regularization
8. Optimization
11. Practical Methodology
9. CNNs
10. RNNs
12. Applications
Part III: Deep Learning Research 13. Linear Factor Models
14. Autoencoders
15. Representation Learning
16. Structured Probabilistic Models
17. Monte Carlo Methods
19. Inference
18. Partition Function
20. Deep Generative Models
Figure 1.6: The high-level organization of the b o ok. An arro arrow w from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter. Figure 1.6: The high-level organization of the b o ok. An arrow from one chapter to another 12 indicates that the former chapter is prerequisite material for understanding the latter.
CHAPTER 1. INTRODUCTION
1.2.1
The Man Many y Names and Changing Fortunes of Neural Networks 1.2.1 The Many Names and Changing Fortunes of Neural Networks We expect that many readers of this book ha hav ve heard of deep learning as an exciting new technology technology,, and are surprised to see a men mention tion of “history” in a book W e expect that many readers of this b o ok ha v e heard of deep learning an ab about out an emerging field. In fact, deep learning dates back to the 1940s. as Deep towas exciting new are surprised see relatively a mention unp of “history” in several a book learning onlytechnology app appeears to, and be new, because it unpopular opular for outpreceding an emerging field. Inpfact, deep learning datesit back to the 1940s. many Deep yab ears its current opularit opularity y, and because has gone through learning appand earshas to bonly e new, because it wascalled relatively opular for several differen differentt only names, recently become “deepunp learning.” The field yhas earsbeen preceding its current p opularit y , and b ecause it has gone through many rebranded many times, reflecting the influence of differen differentt researchers differen t names, and has and differen different t persp erspectiv ectiv ectives. es.only recently become called “deep learning.” The field has been rebranded many times, reflecting the influence of different researchers A comprehensive history of deep learning is bey eyond ond the scope of this textb textbo ook. and different perspectives. Ho How wev ever, er, some basic context is useful for understanding deep learning. Broadly A comprehensive of deep is beyond scope of thisdeep textb ook. sp speaking, eaking, there hav havee bhistory een three wa wav veslearning of developmen development t ofthe deep learning: learnHowknown ever, some basic context useful for understanding deep learning. Broadly ing as cyb cybernetics ernetics in theis1940s–1960s, deep learning known as conne onnectionism ctionism speaking, there have band een the threecurren wavest of developmen t of deep learning: deep learnin the 1980s–1990s, current resurgence under the name deep learning ing known in as 2006. cybernetics in quantitativ the 1940s–1960s, deep learning known b eginning This is quantitatively ely illustrated in Fig. 1.7. as connectionism in the 1980s–1990s, and the current resurgence under the name deep learning Some of the earliest learning algorithms we recognize to toda da day y were intended beginning in 2006. This is quantitatively illustrated in Fig. 1.7. to be computational mo models dels of biological learning, i.e. mo models dels of how learning Some of could the earliest algorithms we recognize danames y werethat intended happ happens ens or happ happen enlearning in the brain. As a result, one of to the deep to b e computational mo dels of biological learning, i.e. mo dels of how learning learning has gone by is artificial neur neural al networks (ANNs). The corresp corresponding onding happ ens or could happ en in the brain. As a result, one of the names that deep persp erspectiv ectiv ectivee on deep learning mo models dels is that they are engineered systems inspired learning has gonebrain by is(whether artificialthe neur al networks correspanimal). onding b y the biological human brain or(ANNs). the brain The of another perspectiv on deep learning moorks dels is thatfor they are engineered inspired While the ekinds of neural netw networks used machine learning systems hav havee sometimes byeen theused biological brain (whether human(Hinton brain orand the Shallice brain of, another animal). b to understand brain the function 1991), they are While the kinds of neural netw orks used for machine learning hav e sometimes generally not designed to be realistic mo models dels of biological function. The neural bersp een used etoonunderstand brain function 1991idea ), they are p erspectiv ectiv ective deep learning is motiv motivated ated(Hinton by tw two o and mainShallice ideas. ,One is that generally not designed to e realistic dels of biological function. The neural the brain provides a pro proof of bby example mo that intelligen intelligent t behavior is possible, and a p ersp ectiv e on deep learning is motiv ated b y tw o main ideas. One idea is conceptually straightforw straightforward ard path to building in intelligence telligence is to rev reverse erse engineerthat the the brain provides a pro of by example that intelligen t b ehavior is p ossible, and a computational principles behind the brain and duplicate its functionalit functionality y. Another conceptually ardbpath to building intelligence is to revthe ersebrain engineer the p ersp erspectiv ectiv ectivee isstraightforw that it would e deeply interesting to understand and the computational behindintelligence, the brain and its functionalit Another principles that principles underlie human so duplicate machine learning mo models delsy.that shed p ersp ectiv e is that it w ould b e deeply interesting to understand the brain and thee ligh lightt on these basic scien scientific tific questions are useful apart from their abilit ability y to solv solve principles that underlie human intelligence, so machine learning mo dels that shed engineering applications. light on these basic scientific questions are useful apart from their ability to solve The mo modern dern term “deep learning” go goes es beyond the neuroscientific persp erspective ective engineering applications. on the curren currentt breed of mac machine hine learning models. It app appeals eals to a more general The mo dern term “deep learning” go es b eyond the neuroscientific ersp ective principle of learning multiple levels of comp omposition osition osition,, which can be appliedpin machine on the curren t breed of mac hine learning models. It app eals to a more general learning framew frameworks orks that are not necessarily neurally inspired. principle of learning multiple levels of composition, which can be applied in machine learning frameworks that are not necessarily neurally inspired. 13
Frequency of Word or Phrase
CHAPTER 1. INTRODUCTION
0.000250 0.000200 0.000150
cyb cybernetics ernetics (connectionism + neural net netw works) cyb ernetics (connectionism + neural networks)
0.000100 0.000050 0.000000 1940
1950
1960
1970
1980
1990
2000
Year
Figure 1.7: The figure shows tw two o of the three historical wa wav ves of artificial neural nets researc research, h, as measured by the frequency of the phrases “cyb “cybernetics” ernetics” and “connectionism” or Figure net 1.7:works” The figure shows twoogle of Bo theoks three esoofrecent artificial neural “neural netw according to Go Google Books (thehistorical third wa wav vwa e isvto too to app appear). ear). nets The researc h,easstarted measured the frequency of the phrases “cyb ernetics” and “connectionism” or first wav ave withbycybernetics in the 1940s–1960s, with the developmen development t of theories “neural networks” according to Go oks ,(the wa,v1949 e is to recent to app ear). The of biological learning (McCullo McCulloch chogle andBo Pitts 1943third ; Hebb ) oand implementations of first w av e started with cybernetics in the 1940s–1960s, with the developmen t of theories the first mo models dels such as the p erceptron (Rosen Rosenblatt blatt, 1958) allo allowing wing the training of a single of biological learning (ve McCullo andthe Pitts , 1943; Hebb , 1949) and implementations of neuron. The second wa wave startedchwith connectionist approach of the 1980–1995 p erio eriod, d, the first mo dels such as(Rumelhart the p erceptron 1958a) allo wing theork training of aorsingle with bac back-propagation k-propagation et al.(,Rosen 1986ablatt ) to ,train neural netw network with one tw two o neuron. The second wa ve started with the connectionist approach of the 1980–1995 p erio d, hidden la layers. yers. The current and third wa wav ve, deep learning, started around 2006 (Hinton et al., 1986a with k-propagation (Rumelhart to train a neural with oneinorbtw o et al.bac , 2006 ; Bengio et al. , 2007; Ranzato et al.,)2007a ), and is justnetw no now work app appearing earing ook hidden The current ve, deepapp learning, around 2006 (Hinton form as laofyers. 2016. The other and tw two o third waveswa similarly appeared eared instarted b o ok form muc uch h later than et al.corresp et al., 2007 et al., 2007a), and is just now app earing in b ook , 2006;onding Bengioscientific ; Ranzato the corresponding activity o ccurred. form as of 2016. The other two waves similarly app eared in b o ok form much later than the corresp onding scientific activity o ccurred.
14
CHAPTER 1. INTRODUCTION
The earliest predecessors of mo modern dern deep learning were simple linear mo models dels motiv motivated ated from a neuroscientific persp erspective. ective. These mo models dels were designed to earliest learning models tak takeeThe a set of n predecessors input values ofx1mo and asso associate ciatewere themsimple with linear an output , . .dern . , xn deep y. motiv ated from a neuroscientific p ersp ective. These mo dels were designed to These mo models dels would learn a set of weigh eights ts w1 , . . . , wn and compute their output tak e a set of input v alues them with output n x , . . . , x y. f (x, w ) = x 1w1 + · · · + xnwn . This firstand wavasso e of ciate neural netw networks orks an researc research h was These would ,learn a set of win eigh ts w kno known wnmo as dels cyb cybernetics ernetics ernetics, as illustrated Fig. 1.7,.. . . , w and compute their output f (x, w ) = x w + + x w . This first wave of neural networks research was The McCullo McCulloch-Pitts ch-Pitts Neuron (McCullo McCulloch ch and Pitts, 1943) was an early mo model del known as cybernetics · · ,· as illustrated in Fig. 1.7. of brain function. This linear mo model del could recognize tw two o different categories of The McCullo ch-Pitts Neuron ( McCullo ch and Pitts , 1943 ) was anforearly del inputs by testing whether f (x, w ) is positiv ositivee or negative. Of course, the mo mo model del of brain function. linear model could o different of to corresp correspond ond to theThis desired definition of therecognize categories,twthe weigh weights ts categories needed to be inputs by testing whether positiv negative. course, for the mo del fts (xcould , w ) isbe set correctly correctly. . These weigh weights set beyorthe human Of op operator. erator. In the 1950s, to corresp ond to(Rosen the desired categories, weigh ts needed be the perceptron Rosenblatt blatt, definition 1958, 1962of ) bthe ecame the firstthe mo model del that could to learn set correctly . These weigh ts could be set b y the human op erator. In the 1950s, the weigh weights ts defining the categories given examples of inputs from each category category.. the padaptive erceptron (Rosen blatt, (ADALINE), 1958, 1962) bwhich ecamedates the first del that The line linear ar element frommo about the could same learn time, the weigh ts defining the categories given examples of inputs from each category simply returned the value of f (x) itself to predict a real num umb ber (Widro Widrow w and . The linecould ar element (ADALINE), fromfrom about the same time, Hoff,adaptive 1960), and also learn to predictwhich thesedates num numbers bers data. simply returned the value of f (x) itself to predict a real number (Widrow and These simple learning algorithms greatly affected the mo modern dern landscap landscapee of Hoff, 1960), and could also learn to predict these numbers from data. mac machine hine learning. The training algorithm used to adapt the weigh weights ts of the ADAThese algorithms affected modern landscap e tly of LINE was asimple sp special eciallearning case of an algorithmgreatly called sto stochastic chasticthe gr gradient adient desc descent ent ent.. Sligh Slightly mac hine learning. algorithm to adapt the remain weightsthe of the ADAmo modified dified versions ofThe the training sto stocchastic gradien gradientt used descent algorithm dominan dominant t LINE was a sp ecial case of an algorithm called sto chastic gr adient desc ent . Sligh tly training algorithms for deep learning mo models dels today today.. modified versions of the stochastic gradient descent algorithm remain the dominant Mo Models dels based on the f (x, w) used by the perceptron and ADALINE are called training algorithms for deep learning models today. line linear ar mo models dels dels.. These mo models dels remain some of the most widely used machine learning (x, wthey Mo dels based on the ) used y the perceptron and areoriginal called mo models, dels, though in man many y fcases arebtrained in differen different t wADALINE ays than the line ar mo dels . These mo dels remain some of the most widely used machine learning mo models dels were trained. models, though in many cases they are trained in different ways than the original Linear mo models dels hav havee many limitations. Most famously famously,, they cannot learn the models were trained. ([0,, 1] , w) = 1 and f ([1 ([1,, 0], w) = 1 but f ([1 ([1,, 1], w) = 0 XOR function, where f ([0 Linear mo dels hav e many limitations. Most famously , they cannot learn the ([0,, 0], w ) = 0. Critics who observ and f ([0 observed ed these fla flaws ws in linear mo models dels caused f ([0 , 1] , w f ([1 , 0] , w f ([1 , 1] , w ) = 0, X OR function, where ) = 1 and ) = 1 but a bac backlash klash against biologically inspired learning in general (Minsky and Pap apert ert ert, f ([0 , 0],w was and ). )= Critics these flaws in linear mo dels caused 1969 1969). This the0.first ma major jorwho dipobserv in theed popularity of neural netw networks. orks. a backlash against biologically inspired learning in general (Minsky and Papert, day, , neuroscience regarded an imp importan ortan ortantt source of inspiration 1969T).oday This was the firstisma jor dip as in the popularity of neural networks. for deep learning researc researchers, hers, but it is no longer the predominan predominantt guide for the field. Today, neuroscience is regarded as an important source of inspiration for deep The main reason for the diminished role of neuroscience in deep learning learning researchers, but it is no longer the predominant guide for the field. researc research h to today day is that we simply do not ha hav ve enough information ab about out the brain The main reason for the diminished role of neuroscience in deep learning to use it as a guide. To obtain a deep understanding of the actual algorithms used researc h to day is that w e simply do not ha v e enough information ab out the brain by the brain, we would need to be able to monitor the activity of (at the very to use thousands it as a guide. To obtain a deepneurons understanding of the actual algorithms least) of interconnected sim simultaneously ultaneously ultaneously. . Because we areused not by the ouldfarneed be able to monitor the activity of (at the very able to brain, do this,weweware fromtounderstanding even some of the most simple and least) thousands of interconnected neurons simultaneously. Because we are not 15 able to do this, we are far from understanding even some of the most simple and
CHAPTER 1. INTRODUCTION
well-studied parts of the brain (Olshausen and Field, 2005). Neuroscience has given us a reason to hop hopee that a single deep learning algorithm well-studied parts of the brain (Olshausen and Field, 2005). can solve man many y differen differentt tasks. Neuroscientists ha hav ve found that ferrets can learn to Neuroscience has given us a reason to hop e that a single deep brains learning algorithm “see” with the auditory pro processing cessing region of their brain if their are rewired can solve man y differen t tasks. Neuroscientists ha v e found that ferrets can learn to to send visual signals to that area (Von Melchner et al. al.,, 2000). This suggests that “see” themammalian auditory pro cessing region if theirtobrains are rewired muc uch hwith of the brain might useofa their singlebrain algorithm solve most of the to send visual signals to that area ( V on Melchner et al. , 2000 ). This suggests that differen differentt tasks that the brain solves. Before this hypothesis, machine learning m uch of the mammalian brain with mightdifferen use a tsingle algorithm solvehers most of the researc research h was more fragmented, different communities of to researc researchers studying differen t tasks that the brain solves. Before this hypothesis, machine learning natural language processing, vision, motion planning and speech recognition. Toda day y, researc h was morecommunities fragmented,are with t communities of researc these application stilldifferen separate, but it is common for hers deepstudying learning natural vision, motion and speechareas recognition. Today,. researc research hlanguage groups toprocessing, study many or even all ofplanning these application sim simultaneously ultaneously ultaneously. these application communities are still separate, but it is common for deep learning We are able to dra draw w some rough guidelines from neuroscience. The basic idea of research groups to study many or even all of these application areas simultaneously. ha having ving many computational units that become in intelligen telligen telligentt only via their interactions e are other able tois dra w someby rough guidelines neuroscience. The basic,idea withWeach inspired the brain. Thefrom Neo Neocognitron cognitron (Fukushima 1980of) hatro ving manya computational that become intelligen viathat theirwinteractions in intro troduced duced pow owerful erful modelunits arc architecture hitecture for pro processing cessingt only images as inspired with each other is inspired by the brain. The Neo cognitron ( F ukushima , 1980 by the structure of the mammalian visual system and later became the basis for) intromo duced pow erful model arc pro cessing images that inspired the modern dern acon conv volutional netw network orkhitecture (LeCun for et al. , 1998b ), as we will seewinasSec. 9.10. by theneural structure of theto mammalian and later became basis for Most netw networks orks toda da day y are basedvisual on a system mo model del neuron called the rethe ctifie ctified d line linear ar the mo dern con v olutional netw ork ( LeCun et al. , 1998b ), as we will see in Sec. 9.10 unit unit.. The original Cognitron (Fukushima, 1975) in intro tro troduced duced a more complicated. netwhighly orks toinspired day are by based awledge model of neuron the The rectifie d linear vMost ersionneural that was our on kno knowledge brain called function. simplified unit . The original Cognitron ( F ukushima , 1975 ) in tro duced a more complicated mo modern dern version was dev developed eloped incorp incorporating orating ideas from many viewp viewpoints, oints, with Nair vand ersion that inspiredetby kno)wledge brain function. simplified Hin Hinton ton (was 2010highly ) and Glorot al.our (2011a citing of neuroscience as anThe influence, and mo dern version was dev eloped incorp orating ideas from many viewp oints, with Nair Jarrett et al. (2009) citing more engineering-oriented influences. While neuroscience and ton (2010source ) and Glorot et al. (2011a ) citing influence, is anHin imp important ortant of inspiration, it need notneuroscience be tak taken en asasa an rigid guide. and We Jarrett et al. ( 2009 ) citing more engineering-oriented influences. While neuroscience kno know w that actual neurons compute very different functions than mo modern dern rectified is an imp ortant of neural inspiration, it need not yet be tak rigid guide. linear units, butsource greater realism has not leden toasana impro improv vemen ementtWine know that actual p neurons compute very different functions modern rectified mac machine hine learning erformance. Also, while neuroscience hasthan successfully inspired linear units, but greater neural realism has not yet led to an impro v emen t in sev several eral neural netw network ork arc know w enough ab about out biological architectures hitectures, we do not yet kno mac hine learning p erformance. Also, while neuroscience has successfully inspired learning for neuroscience to offer muc much h guidance for the learning algorithms we several neuralthese netwarchitectures. ork architectures, we do not yet know enough about biological use to train learning for neuroscience to offer much guidance for the learning algorithms we Media accoun accounts ts often emphasize the similarity of deep learning to the brain. use to train these architectures. While it is true that deep learning researchers are more lik likely ely to cite the brain as an Media accoun ts often emphasize the similarity of deep tohthe brain. influence than researchers working in other machine learninglearning fields suc such as kernel While it isor true that deep learningone researchers are view moredeep likelylearning to cite the brain as an mac machines hines Bay Bayesian esian statistics, should not as an attempt influence than working other machine learning fields sucmany h as kernel to sim simulate ulate theresearchers brain. Mo Modern dern deepinlearning draws inspiration from fields, mac hines or Bay esian statistics, one should not view deep learning as an attempt esp especially ecially applied math fundamentals like linear algebra, probability probability,, information to sim ulate the brain. Mo dern deep learning draws inspiration many theory theory,, and numerical optimization. While some deep learningfrom researc researchers hersfields, cite especially applied fundamentals linear algebra, , information neuroscience as anmath imp important ortant source oflike inspiration, othersprobability are not concerned with theory, and numerical optimization. While some deep learning researchers cite neuroscience as an important source of inspiration, others are not concerned with 16
CHAPTER 1. INTRODUCTION
neuroscience at all. It is w worth orth noting that the effort to understand ho how w the brain works on neuroscience at all. an algorithmic lev level el is aliv alivee and well. This endea endeav vor is primarily known as It is w orth noting that the effort to understand how the brain on “computational neuroscience” and is a separate field of study from deepworks learning. an isalgorithmic el is aliv and well. This is een primarily knownThe as It common forlev researc researchers herse to mov move e bac back k andendea forthvor betw etween both fields. “computational neuroscience” andconcerned is a separate field deep learning. field of deep learning is primarily with howoftostudy buildfrom computer systems It is common for researc hers to mov e bac k and forth b etw een both fields. The that are able to successfully solv solvee tasks requiring in intelligence, telligence, while the field of field of deep learning is primarily concerned with how to build computer systems computational neuroscience is primarily concerned with building more accurate that areofable successfully solveworks. tasks requiring intelligence, while the field of mo models dels howtothe brain actually computational neuroscience is primarily concerned with building more accurate In the thebrain second wave of neural net netw work research emerged in great part models of 1980s, how the actually works. via a mov movemen emen ementt called conne onnectionism ctionism or par aral al allel lel distribute distributed d pr pro ocessing (Rumelhart In the 1980s, the second w a v e of neural net w ork research emerged great part et al., 1986c; McClelland et al., 1995). Connectionism arose in theincontext of via a mov emen t called c onne ctionism or p ar al lel distribute d pr o c essing ( Rumelhart cognitiv cognitivee science. Cognitiv Cognitivee science is an interdisciplinary approach to understandet al. , 1986c ; McClelland et al., 1995 ). Connectionism arose in the context of ing the mind, combining multiple different lev levels els of analysis. During the early cognitiv e science. Cognitiv e science is an interdisciplinary understand1980s, most cognitive scientists studied mo models dels of sym symb bolic approach reasoning.toDespite their ing the mind, combining multiple different lev els of analysis. During the early popularit opularity y, symbolic mo models dels were difficult to explain in terms of ho how w the brain 1980s, most cognitive scientists studied mo dels of sym b olic reasoning. Despite their could actually implement them using neurons. The connectionists began to study p opularit symbolicthat models were difficult explain in in neural terms of how thetations brain mo models dels ofy,cognition could actually be to grounded implemen implementations actually them), using neurons. study (could Touretzky andimplement Min Minton ton, 1985 reviving man many yThe ideasconnectionists dating bac back k btoegan the to work of mo dels of cognition that could actually b e grounded in neural implemen tations psyc psychologist hologist Donald Hebb in the 1940s (Hebb, 1949). (Touretzky and Minton, 1985), reviving many ideas dating back to the work of central idea inHebb connectionism is that a large num number psycThe hologist Donald in the 1940s (Hebb , 1949 ). ber of simple computational units can ac achiev hiev hievee in intelligen telligen telligentt behavior when net netw work orked ed together. This insight The central idea in connectionism is that a large num ber simple computational applies equally to neurons in biological nerv nervous ous systems of and to hidden units in units can ac hiev e in telligen t behavior when net w ork ed together. This insight computational mo models. dels. applies equally to neurons in biological nervous systems and to hidden units in Sev Several eral key concepts arose during the connectionism mov movemen emen ementt of the 1980s computational models. that remain central to to today’s day’s deep learning. Several key concepts arose during the connectionism movement of the 1980s of these concepts is that of distribute distributed epresentation esentation (Hinton et al., 1986). thatOne remain central to today’s deep learning.d repr This is the idea that eac each h input to a system should be represen represented ted by man many y features, these concepts that of distribute d r epr esentation ( Hinton et al.,inputs. 1986). andOne eachoffeature should beis in olv in the represen of many p ossible inv volved ed representation tation This is the ideasuppose that eacwhe input a system should represen ted bcars, y mantruc y features, For example, ha hav ve atovision system thatbecan recognize trucks, ks, and and each feature should b e in v olv ed in the represen tation of many p ossible inputs. birds and these ob objects jects can eac each h be red, green, or blue. One way of representing F or example, suppose w e ha v e a vision system that can recognize cars, truc ks, these inputs would be to ha hav ve a separate neuron or hidden unit that activ activates atesand for birds and these ob jects can eac h b e red, green, or blue. One w a y of representing eac each h of the nine possible combinations: red truck, red car, red bird, green truck, and these would benine to ha ve a separate neuron or hidden unit that activ ates for so on.inputs This requires different neurons, and each neuron must indep independently endently each of theconcept nine possible red truck, bird, green and learn the of colorcombinations: and ob object ject identit identity y. Onered wacar, y to red impro improv ve on thistruck, situation so on. This requires nine different neurons, and each neuron m ust indep endently is to use a distributed representation, with three neurons describing the color and learn the concept of color and object ject identit yy.. One ay to impro on neurons this situation three neurons describing the ob object identit identity Thiswrequires onlyvesix total is to useofa nine, distributed representation, with three neurons the and instead and the neuron describing redness is abledescribing to learn ab about outcolor redness three neurons describing the ob ject identity. This requires only six neurons total instead of nine, and the neuron describing 17 redness is able to learn ab out redness
CHAPTER 1. INTRODUCTION
from images of cars, trucks and birds, not only from images of one sp specific ecific category of ob objects. jects. The concept of distributed represen representation tation is central to this book, and frombimages ofed cars, trucks and birds, not only 15 from will e describ described in greater detail in Chapter . images of one specific category of ob jects. The concept of distributed representation is central to this book, and ma major accomplishment the connectionist mov movemen emen ementt was the sucwill Another be describ edjor in accomplishmen greater detail int of Chapter 15. cessful use of back-propagation to train deep neural net netw works with in internal ternal repreAnother mathe jor paccomplishmen t of connectionist mov ement was the sucsen sentations tations and opularization of thethe back-propagation algorithm (Rumelhart cessful use of; back-propagation to algorithm train deep has neural netwand orkswaned with in repreet al. al.,, 1986a LeCun, 1987). This waxed internal popularity sentations andwriting the popularization of the back-propagation (Rumelhart but as of this is currently the dominan dominant t approac approach h toalgorithm training deep models. et al., 1986a; LeCun, 1987). This algorithm has waxed and waned in popularity thewriting 1990s, is researc researchers hers made important adv advances ances mo modeling deling sequences but During as of this currently the dominan t approac h tointraining deep models. with neural netw networks. orks. Ho Hochreiter chreiter (1991) and Bengio et al. (1994) iden identified tified some During the 1990s, researchers made important advances modeling sequences of the fundamental mathematical difficulties in mo modeling deling longinsequences, describ described ed with neural netw orks. Ho chreiter ( 1991 ) and Bengio et al. ( 1994 ) iden tified some in Sec. 10.7. Hochreiter and Sc Schmidh hmidh hmidhub ub uber er (1997) introduced the long short-term of the fundamental mathematical difficulties modeling long sequences, describ ed memory or LSTM net netw work to resolv resolvee some ofinthese difficulties. Toda day y, the LSTM in widely Sec. 10.7 . Hochreiter and Schmidh uber tasks, (1997)including introduced thenatural long short-term is used for many sequence mo modeling deling many language memory or LSTM net w ork to resolv e some of these difficulties. T o da y , the LSTM pro processing cessing tasks at Go Google. ogle. is widely used for many sequence modeling tasks, including many natural language The second wa wave ve of neural netw networks orks research lasted un until til the mid-1990s. Venprocessing tasks at Google. tures based on neural netw networks orks and other AI technologies began to make unrealistisecond wa ve ofwhile neural networks research lasted the mid-1990s. VencallyThe ambitious claims seeking inv investments. estments. When un AItil research did not fulfill tures based on neuralexp netw orks and inv other AI technologies bointed. egan to Simultaneously make unrealisti-, these unreasonable expectations, ectations, investors estors were disapp disappointed. Simultaneously, cally ambitious claims while seeking inv estments. When AI research not et fulfill other fields of mac machine hine learning made adv advances. ances. Kernel mac machines hines (did Boser al., these unreasonable exp ectations, inv estors were disapp ointed. Simultaneously 1992; Cortes and Vapnik, 1995; Schölk Schölkopf opf et al. al.,, 1999) and graphical mo models dels (Jor-, other fields hineedlearning madeonadv ances. Kernel machines (Boser et al., dan , 1998 ) bof othmac achiev achieved go goood results many imp importan ortan ortantt tasks. These two factors 1992 ; Cortes andinVthe apnik , 1995; Schölk opf etnetw al., orks 1999)that andlasted graphical dels (Jorled to a decline popularity of neural networks untilmo 2007. dan, 1998) both achieved good results on many important tasks. These two factors During this time, neural net netw works con contin tin tinued ued to obtain impressiv impressivee performance led to a decline in the popularity of neural networks that lasted until 2007. on some tasks (LeCun et al., 1998b; Bengio et al., 2001). The Canadian Institute thisResearc time, neural net works con to neural obtain net impressiv performance for During Adv Advanced anced Research h (CIF (CIFAR) AR) help helped ed tin toued keep netw works eresearch alive on some tasks ( LeCun et al. , 1998b ; Bengio et al. , 2001 ). The Canadian Institute via its Neural Computation and Adaptiv Adaptivee Perception (NCAP) research initiative. for Adv anced Researc h (CIF AR) helpedresearc to keep neural led netw orks research alive This program united machine learning research h groups by Geoffrey Hinton via its Neural Computation and Adaptiv e P erception (NCAP) research initiative. at Universit University y of Toron oronto, to, Yosh oshua ua Bengio at Univ Universit ersit ersity y of Montreal, and Yann This program learning researc groupsresearch led by Geoffrey LeCun at Newunited York machine Universit University y. The CIF CIFAR AR hNCAP initiativeHinton had a at Universit y of T oron to, Y osh ua Bengio at Univ ersit y of Montreal, and Yann multi-disciplinary nature that also included neuroscien neuroscientists tists and experts in human LeCun at New Y ork Universit y . The CIF AR NCAP research initiative had a and computer vision. multi-disciplinary nature that also included neuroscientists and experts in human At this poin ointt in time, deep netw networks orks were generally believ elieved ed to be very difficult and computer vision. to train. W Wee now know that algorithms that hav havee existed since the 1980s work Atwell, this but pointhis t in w time, deep networks were generally believ to be vsimply ery difficult quite as not apparent circa 2006. The issue isedperhaps that to train. W e now know that algorithms that hav e existed since the 1980s work these algorithms were to too o computationally costly to allo allow w muc much h exp experimentation erimentation quite well, but this w as not apparent circa 2006. The issue is p erhaps simply that with the hardware av available ailable at the time. these algorithms were too computationally costly to allow much experimentation The third wa wav ve of neural netw networks orks research began with a breakthrough in with the hardware available at the time. The third wave of neural networks18research began with a breakthrough in
CHAPTER 1. INTRODUCTION
2006. Geoffrey Hinton show showed ed that a kind of neural netw network ork called a deep belief net netw work could be efficien efficiently tly trained using a strategy called greedy la lay yer-wise 2006. Geoffrey Hinton show ed that a kind of neural netw ork called a deep belief pretraining (Hin Hinton ton et al. al.,, 2006), which will be describ described ed in more detail in Sec. net w ork could be efficien tly trained using a strategy called greedy la y er-wise 15.1 15.1.. The other CIF CIFAR-affiliated AR-affiliated research groups quickly show showed ed that the same pretraining ( Hin ton et al. , 2006 ), which will b e describ ed in more detail inetSec. strategy could be used to train man many y other kinds of deep net netw works (Bengio al., 15.1.; The otheretCIF research groups quickly showede that the same 2007 Ranzato al. al.,,AR-affiliated 2007a) and systematically help helped ed to improv improve generalization strategy could be This used wa to vtrain many netw otherorks kinds of deep networks (the Bengio et the al., on test examples. wav e of neural networks researc research h popularized use of 2007 Ranzato et al.to , 2007a ) and systematically help ed to improv term; de deep ep le learning arning emphasize that researchers were no now w able etogeneralization train deeper on test examples. This wa v e of neural netw orks researc h p opularized the useon of the the neural net netw works than had been possible before, and to fo focus cus attention term deep leimp arning to emphasize that researchers were no;wDelalleau able to train deeper, theoretical importance ortance of depth (Bengio and LeCun , 2007 and Bengio neural works had b; een possible efore, and to fo custime, attention the 2011 ; Pnet ascan ascanu u et than al. al.,, 2014a Montufar et bal. al., , 2014 ). At this deep on neural theoretical imperformed ortance ofcompeting depth (Bengio and LeCun , 2007 ; Delalleau andlearning Bengio, net netw works outp outperformed AI systems based on other machine 2011 ; P ascan u et al. , 2014a ; Montufar et al. , 2014 ). At this time, deep neural tec technologies hnologies as well as hand-designed functionalit functionality y. This third wave of popularity netneural works netw outporks erformed competing onthough other machine of networks con contin tin tinues ues to the AI timesystems of this based writing, the fo focus cuslearning of deep tec hnologies as well as hand-designed functionalit y . This third w a v e of popularity learning research has changed dramatically within the time of this wave. The of neural orkswith contin to on thenew timeunsup of this writing, though the focus and of deep third wavnetw e began a ues fo focus cus unsupervised ervised learning techniques the learning research has changed dramatically within the time of this w a v e. The abilit ability y of deep mo models dels to generalize well from small datasets, but to today day there is third wave began a fo cus on new unsup ervised learning techniques the more interest in mwith uc older sup learning algorithms and the abilit of deep uch h supervised ervised ability y and abilit y of models generalize well from small datasets, but today there is mo models dels to deep leverage largetolab labeled eled datasets. more interest in much older supervised learning algorithms and the ability of deep models to leverage large labeled datasets.
1.2.2
Increasing Dataset Sizes
One wonder whyDataset deep learning 1.2.2mayIncreasing Sizeshas only recently become recognized as a crucial tec technology hnology though the first exp experiments eriments with artificial neural net netw works were One may wonder why deep become as a conducted in the 1950s. Deeplearning learninghas hasonly beenrecently successfully usedrecognized in commercial crucial technology though thebut firstwexp eriments with artificial netan works were applications since the 1990s, as often regarded as beingneural more of art than inand the something 1950s. Deep learning bert eencould successfully intly commercial aconducted technology that only anhas exp expert use, untilused recen recently tly. . It is true applications since the 1990s, but w as often regarded as being more of an art than that some skill is required to get go goood performance from a deep learning algorithm. a ortunately technology, the and amoun something only an exp ert could use,amount until recen tly. It is data true F ortunately, amount t of that skill required reduces as the of training that some The skill learning is required to get gooreac d phing erformance a deep learning algorithm. increases. algorithms reaching human from performance on complex tasks F ortunately , the amoun t of skill required reduces as the amount of training to toda da day y are nearly iden identical tical to the learning algorithms that struggled to solvedata toy increases. The learning algorithms reac hing human p erformance on complex tasks problems in the 1980s, though the mo models dels we train with these algorithms hav havee to day are nearly tical to thethe learning algorithms thatarchitectures. struggled to The solvemost toy undergone changesiden that simplify training of very deep problems thedevelopmen 1980s, though the mo delswe wecan train with these imp importan ortan ortanttinnew development t is that to today day provide these algorithms algorithms hav withe undergone changes the training very The most the resources they that needsimplify to succeed. Fig. 1.8ofsho shows wsdeep howarchitectures. the size of benchmark imp ortan t new developmen t is that to day we can provide these algorithms with datasets has increased remark remarkably ably ov over er time. This trend is driven by the increasing the resources need succeed. Fig.of1.8 ws howtake the place size of enchmark digitization of they so societ ciet ciety y. As to more and more oursho activities onbcomputers, datasets increased remark ablyisov er time. This trend is driven by increasing more andhas more of what we do recorded. As our computers arethe increasingly digitization of societyit. As more and more of our activities take place computers, net netw work orked ed together, becomes easier to centralize these records andoncurate them more and more of what we do is recorded. As our computers are increasingly networked together, it becomes easier to19centralize these records and curate them
CHAPTER 1. INTRODUCTION
in into to a dataset appropriate for mac machine hine learning applications. The age of “Big Data” has made mac machine hine learning muc much h easier because the key burden of statistical in to a dataset appropriate for mac hine learning applications. age amoun of “Bigt estimation—generalizing well to new data after observing only The a small amount Data” has made macconsiderably hine learning lightened. much easierAs because the key burden of of statistical of data—has been of 2016, a rough rule th thum um umb b estimation—generalizing w ell to new data after observing only a small amoun is that a sup supervised ervised deep learning algorithm will generally achiev achievee acceptablet oferformance data—haswith beenaround considerably lightened. As ofper 2016, a rough thumor b p 5,000 lab labeled eled examples category category, , andrule will of match is that human a supervised deep learning algorithm generally achiev acceptable exceed performance when trained withwill a dataset con containing taininge at least 10 p erformance with around 5,000 lab eled examples per category , and will match million lab labeled eled examples. Working successfully with datasets smaller than this or is exceed human performance when trained with a dataset con taining at least 10 an imp importan ortan ortantt research area, fo focusing cusing in particular on ho how w we can take adv advantage antage million eled examples. Weled orking successfully datasets thanervised this is of large lab quantities of unlab unlabeled examples, with with unsup unsupervised ervised smaller or semi-sup semi-supervised an important research area, focusing in particular on how we can take advantage learning. of large quantities of unlabeled examples, with unsupervised or semi-supervised learning.
1.2.3
Increasing Mo Model del Sizes
1.2.3 kIncreasing Mo del Sizes Another ey reason that neural net netw works are wildly successful to toda da day y after enjoying comparativ comparatively ely little success since the 1980s is that we hav havee the computational Another ey run reason that neural net wildly successful today after enjoying resourceskto muc uch h larger mo models delsworks to toda da day yare . One of the main insights of connectioncomparativ ely littlebecome successin since the 1980sman is that we hav e the computational ism is that animals intelligen telligen telligent t when many y of their neurons work together. resources to run m uc h larger mo dels to da y . One of the main insights of connectionAn individual neuron or small collection of neurons is not particularly useful. ism is that animals become intelligent when many of their neurons work together. Biological neurons are not esp especially ecially densely connected. As seen in Fig. 1.10, An individual neuron or small collection of neurons is not particularly useful. our mac machine hine learning mo models dels hav havee had a num number ber of connections per neuron that Biological neurons are not esp ecially densely connected. in Fig. 1.10, was within an order of magnitude of even mammalian brains As for seen decades. our machine learning models have had a number of connections per neuron that In terms of the total num umber ber of neurons, neural netw networks orks hav havee been astonishingly was within an order of magnitude of even mammalian brains for decades. small until quite recently recently,, as shown in Fig. 1.11. Since the introduction of hidden In terms of the totalnet num ber of neurons, neural netw orks hav e been units, artificial neural netw works ha hav ve doubled in size roughly every 2.4astonishingly years. This small until quite recently , as shown in Fig. 1.11 . Since the introduction hidden gro growth wth is driv driven en by faster computers with larger memory and by the avof ailability units, artificial neural netwnet orksworks have are doubled in achiev size roughly every 2.4 years. This of larger datasets. Larger netw able to achieve e higher accuracy on more gro wth is driv en b y faster computers with larger memory and b y the a v ailability complex tasks. This trend looks set to contin continue ue for decades. Unless new tec technologies hnologies of larger datasets. netw orks are able achiev higher onber more allo allow w faster scaling,Larger artificial neural netw networks orkstowill note hav have e theaccuracy same num number of complex tasks. This trend looks set to contin ue for decades. Unless new tec hnologies neurons as the human brain until at least the 2050s. Biological neurons ma may y allo w faster scaling, artificial neural netw orks will not hav e the same num ber of represen representt more complicated functions than curren currentt artificial neurons, so biological neurons as the human brain until at least Biological neurons may neural net netw works may be even larger than thisthe plot2050s. portrays. represent more complicated functions than current artificial neurons, so biological In retrosp retrospect, ect, it is not particularly surprising that neural net netw works with fewer neural networks may be even larger than this plot portrays. neurons than a leec leech h were unable to solv solvee sophisticated artificial in intelligence telligence probIn retrosp ect, it is not particularly surprising that neural net w orks with fewer lems. Ev Even en to today’s day’s netw networks, orks, whic which h we consider quite large from a computational neurons than leec h were to solv artificialofineven telligence probsystems pointa of view, areunable smaller thane sophisticated the nervous system relatively lems. Eveen today’s netw orks,like whicfrogs. h we consider quite large from a computational primitiv primitive vertebrate animals systems point of view, are smaller than the nervous system of even relatively The increase in mo model del size over time, due to the availabilit ailability y of faster CPUs, primitive vertebrate animals like frogs. The increase in model size over time, 20 due to the availability of faster CPUs,
CHAPTER 1. INTRODUCTION
Increasing dataset size over time
Dataset size (number examples)
9
10
Canadian Hansard
8
10
Increasing dataset size overWMT time Sports-1M
7
10
ImageNet10k
6
10
5
10
Criminals
Public SVHN ImageNet
4
10
MNIST
3
10
102
Iris
1
10
T vs G vs F
ILSVRC 2014 CIFAR-10
Rotated T vs C
0
10
1900
1950
1985 2000 2015
Year
Figure 1.8: Dataset sizes ha hav ve increased greatly ov over er time. In the early 1900s, statisticians studied datasets using hundreds or thousands of manually compiled measuremen measurements ts (Garson, Figure 1.8: Dataset sizes hav,e1935 increased greatly er the time. In the early 1980s, 1900s, the statisticians 1900 ; Gosset , 1908; Anderson ; Fisher , 1936).ovIn 1950s through pioneers studied datasets using hundreds or thousands of manually compiled measuremen ts (Garson of biologically inspired mac machine hine learning often work orked ed with small, syn synthetic thetic datasets, such, 1900 ;w-resolution Gosset, 1908bitmaps ; Anderson , 1935; that Fisher , 1936 ). In the 1980s, the pioneers as lo low-resolution of letters, were designed to1950s incur through lo low w computational cost and of biologicallythat inspired mac hineorks learning oftentowork ed sp with small, syn datasets, such demonstrate neural netw networks were able learn specific ecific kinds ofthetic functions (Widrow as lo w-resolution bitmaps of letters, that were designed to incur lo w computational cost and and Hoff, 1960; Rumelhart et al. al.,, 1986b). In the 1980s and 1990s, machine learning demonstrate neural in netw orks and werebable sp ecific kinds of functions (Widrow b ecame morethat statistical nature egantotolearn lev leverage erage larger datasets con containing taining tens et al. andthousands Hoff, 1960 Rumelhartsuch 1986b ). In the 1980s and learning of of; examples as, the MNIST dataset (sho (shown wn1990s, in Fig.machine 1.9) of scans of b ecame more statistical in nature and b egan to lev erage larger datasets con taining tens handwritten num numbers bers (LeCun et al., 1998b). In the first decade of the 2000s, more of thousands datasets of examples such as the (showndataset in Fig. (1.9 ) of scansand of sophisticated of this same size,MNIST such as dataset the CIF CIFAR-10 AR-10 Krizhevsky et duced. al., 1998b handwritten bers ).ardInthe the first decade of the more Hin Hinton ton, 2009)num contin continued ued(LeCun to b e pro produced. Tow oward end of that decade and2000s, throughout sophisticated of this same size,larger such datasets, as the CIF AR-10 dataset (Krizhevsky and the first half ofdatasets the 2010s, significantly containing hundreds of thousands Hintens ton,of2009 ) contin to b e pro duced. Tchanged oward the endwofasthat decade and throughout to millions of ued examples, completely what p ossible with deep learning. the first half of included the 2010s,the significantly larger datasets, hundreds(Netzer of thousands These datasets public Street View Housecontaining Numbers dataset et al., to tens of millions of examples, completely changed what w as p ossible with deep learning. 2011 2011), ), various versions of the ImageNet dataset (Deng et al., 2009, 2010a; Russako Russakovsky vsky et the al., These included the public dataset Street View House et Numbers dataset (Netzer et al., datasets 2014a), and the Sp Sports-1M orts-1M (Karpathy al., 2014 ). At the top of et al. 2011), vwe arious versions of theofImageNet (Deng 2009, dataset 2010a; Russako vsky graph, see that datasets translateddataset sentences, such as ,IBM’s constructed et al. et al. , 2014a ), and the Sp orts-1M dataset ( Karpathy , 2014 ). A t the top of the from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to Frenc rench h graph, we see that datasets of translated sentences, such as IBM’s dataset constructed dataset (Sch Schwenk wenk, 2014) are typically far ahead of other dataset sizes. from the Canadian Hansard (Brown et al., 1990) and the WMT 2014 English to French dataset (Schwenk, 2014) are typically far ahead of other dataset sizes.
21
CHAPTER 1. INTRODUCTION
Figure 1.9: Example inputs from the MNIST dataset. The “NIST” stands for National Institute of Standards and Technology echnology,, the agency that originally collected this data. Figure Example inputs from thethe MNIST dataset. “NIST” for National The “M”1.9: stands for “mo “modified,” dified,” since data has b een The prepro preprocessed cessedstands for easier use with Institute of Standards and T echnology , the agency that originally collected this data. mac machine hine learning algorithms. The MNIST dataset consists of scans of handwritten digits The standslab forels “mo dified,” since the data0-9 has eentained prepro for easier usesimple with and “M” asso associated ciated labels describing whic which h digit is bcon contained incessed each image. This machine learning algorithms. consists of scans of handwritten digits classification problem is one ofThe theMNIST simplestdataset and most widely used tests in deep learning and asso lab els p describing whichbdigit is con tained indern each tec image. Thistosimple researc research. h.ciated It remains opular despite eing 0-9 quite easy for mo modern techniques hniques solve. classification problem is one ed of the and most used tests in meaning deep learning Geoffrey Hin Hinton ton has describ described it assimplest “the dr drosophila osophila of widely machine learning,” that researc h. machine It remains p opular despite b quite easy for mo dern techniques to solve. it allows learning researchers toeing study their algorithms in controlled lab laboratory oratory osophila Geoffrey Hin tonh has describ edoften it as study “the drfruit conditions, muc much as biologists flies. of machine learning,” meaning that it allows machine learning researchers to study their algorithms in controlled lab oratory conditions, much as biologists often study fruit flies.
22
CHAPTER 1. INTRODUCTION
the adven adventt of general purp purpose ose GPUs (describ (described ed in Sec. 12.1.2), faster net netw work connectivit connectivity y and better softw software are infrastructure for distributed computing, is one of the most advenimp t ofortant general purpin osethe GPUs (describ ed learning. in Sec. 12.1.2 ), faster network the important trends history of deep This trend is generally connectivit and better softw arethe infrastructure for distributed computing, is one of exp expected ected toy contin continue ue well in into to future. the most important trends in the history of deep learning. This trend is generally expected to continue well into the future.
1.2.4
Increasing Accuracy Accuracy,, Complexit Complexity y and Real-W Real-World orld Impact
1.2.4 theIncreasing y anded Real-W orld Impact Since 1980s, deep Accuracy learning has, Complexit consistently improv improved in its ability to provide accurate recognition or prediction. Moreov Moreover, er, deep learning has consisten consistently tly been Since the 1980s, deep learning has consistently improv ed in its ability to provide applied with success to broader and broader sets of applications. accurate recognition or prediction. Moreover, deep learning has consistently been The earliest deep mo models dels were used to recognize individual ob objects jects in tightly applied with success to broader and broader sets of applications. cropp extremely small images ( Rumelhart et al. , 1986a ). Since then there has cropped, ed, The earliest deep mo dels w ere used to recognize individual ob jects tightly been a gradual increase in the size of images neural net netw works could pro process. cess.inMo Modern dern cropp ed, extremely small images ( Rumelhart et al. , 1986a ). Since then there has ob object ject recognition netw networks orks pro process cess ric rich h high-resolution photographs and do not been increase in the the size of images neuralnear netwthe orksob could probcess. Modern ha hav ve aa gradual requirement that photo be cropped object ject to e recognized ob ject recognition netw orks pro cess ric h high-resolution photographs and do not (Krizhevsky et al., 2012). Similarly Similarly,, the earliest netw networks orks could only recognize a requirement photo be the cropped near ob jectoftoa bsingle e recognized tha wovekinds of ob objects jects that (or inthe some cases, absence orthe presence kind of (ob Krizhevsky et al. , 2012 ). Similarly , the earliest netw orks could only recognize object), ject), while these mo modern dern net netw works typically recognize at least 1,000 different tcategories wo kinds of ob jects (or in some cases, the in absence presence ofisa the single kind of of ob objects. jects. The largest contest ob object jectorrecognition ImageNet ob ject), while theseRecognition modern netChallenge works typically recognize least 1,000 different Large-Scale Visual (ILSVRC) held at each year. A dramatic categories jects. The in came objectwhen recognition is the ImageNet momen momentt in of theobmeteoric riselargest of deepcontest learning a con conv volutional netw network ork Large-Scale Visual Recognition Challenge (ILSVRC) held each y ear. A dramatic won this challenge for the first time and by a wide margin, bringing down the moment in the meteoric riserate of deep came when a convolutional ork state-of-the-art top-5 error fromlearning 26.1% to 15.3% (Krizhevsky et al.netw , 2012 ), w on this challenge for the first time and by a wide margin, bringing down the meaning that the conv convolutional olutional net netw work pro produces duces a ranked list of possible categories state-of-the-art top-5 ratecategory from 26.1% to 15.3% (Krizhevsky et al., of 2012 ), for eac each h image and theerror correct appeared in the first fiv fivee entries this meaning that convolutional netw ork produces a ranked of pcomp ossible categories list for all butthe15.3% of the test examples. Since then, list these competitions etitions are for eac h image and the correct category appeared in the first fiv e entries of this consisten consistently tly won by deep conv convolutional olutional nets, and as of this writing, adv advances ances in list for all but 15.3% of the test examples. Since then, these comp etitions are deep learning ha hav ve brought the latest top-5 error rate in this contest do down wn to 3.6%, consisten won 1.12 by deep convolutional nets, and as of this writing, advances in as sho shown wn tly in Fig. . deep learning have brought the latest top-5 error rate in this contest down to 3.6%, Deep has. also had a dramatic impact on sp speech eech recognition. After as sho wn learning in Fig. 1.12 impro improving ving throughout the 1990s, the error rates for sp speech eech recognition stagnated Deep learning has also had a dramatic impact on sp(eech starting in ab about out 2000. The introduction of deep learning Dahlrecognition. et al., 2010; After Deng impro the 1990s, the error for) to speech recognition stagnated et al., ving 2010bthroughout ; Seide et al. , 2011 ; Hinton et al.rates , 2012a sp speech eech recognition resulted starting in ab out 2000. The introduction deep rates learning Dahl al.e, will 2010explore ; Deng in a sudden drop of error rates, with someoferror cut (in half.et W et al.history , 2010b;inSeide al., 2011 ; Hinton this moreetdetail in Sec. 12.3.et al., 2012a) to speech recognition resulted in a sudden drop of error rates, with some error rates cut in half. We will explore Deep net netw works ha hav ve also had sp spectacular ectacular successes for pedestrian detection and this history in more detail in Sec. 12.3. image segmentation (Sermanet et al., 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, Deep net w orks ha v e also had sp ectacular successes for p edestrian detection and 2013 2013)) and yielded sup superh erh erhuman uman performance in traffic sign classification (Ciresan image segmentation (Sermanet et al., 2013; Farabet et al., 2013; Couprie et al., 2013) and yielded superhuman performance in traffic sign classification (Ciresan 23
CHAPTER 1. INTRODUCTION
4
Connections per neuron
10
Number of connections per neuron over time Number of connections per neuron6 over 9time 7 4
3
10
10
5 2
3
1
Cat Mouse
2
10
Human
8
Fruit fly
1
10
1950
1985
2000
2015
Year
Figure 1.10: Initially Initially,, the number of connections b et etw ween neurons in artificial neural net netw works was limited by hardware capabilities. To day day,, the num number ber of connections b et etween ween Figure 1.10: Initially , the consideration. number of connections b etween neurons in artificial neural neurons is mostly a design Some artificial neural netw networks orks hav havee nearly as net w orks was limited b y hardware capabilities. T o day , the num ber of connections b et ween man many y connections p er neuron as a cat, and it is quite common for other neural net netw works neurons is many mostlyconnections a design consideration. artificial neural orks havethe nearly as to hav havee as p er neuron asSome smaller mammals likenetw mice. Even human many do connections as at cat, andt of it is quite common for otherBiological neural netneural works brain does es not ha hav vpeeranneuron exorbitan exorbitant amoun amount connections p er neuron. to hav e as many connections neuron as smaller mammals like mice. Even the human net netw work sizes from Wikip Wikipedia ediap(er 2015 ). brain do es not have an exorbitant amount of connections p er neuron. Biological neural Adaptive and Hoff netw1.ork sizes linear fromelement Wikip(Widrow edia (2015 ). , 1960) 2. Neocognitron (Fukushima, 1980) 3. GPU-accelerated convolutional network (Chellapilla et al., 2006) 4. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 5. Unsupervised convolutional network (Jarrett et al., 2009) 6. GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 7. Distributed autoencoder (Le et al., 2012) 8. Multi-GPU convolutional network (Krizhevsky et al., 2012) 9. COTS HPC unsupervised convolutional network (Coates et al., 2013) 10. GoogLeNet (Szegedy et al., 2014a)
24
CHAPTER 1. INTRODUCTION
et al. al.,, 2012). At the same time that the scale and accuracy of deep netw networks orks has increased, et al., 2012). so has the complexity of the tasks that they can solve. Go Goo odfellow et al. (2014d) A t the same time that the scale and accuracy of deep netw orks has increased, sho show wed that neural netw networks orks could learn to output an entire sequence of characters so has the of therather tasksthan thatjust theyidentifying can solve.aGo odfellow et al. (2014d), transcrib transcribed edcomplexity from an image, single ob object. ject. Previously Previously, showwasedwidely that neural netw orksthis could learn to output an entire sequence of characters it believed that kind of learning required lab labeling eling of the individual transcrib ed from an image, rather than just identifying a single ob ject. Previously elemen elements ts of the sequence (Gülçehre and Bengio, 2013). Recurren Recurrentt neural net netw works,, it whasaswidely believedsequence that thismo kind learning required the individual suc such the LSTM model delofmentioned ab abov ov ove, e, lab areeling nowofused to mo model del elemen ts of the sequence ( Gülçehre and Bengio , 2013 ). Recurren t neural net w orks, relationships bet etw ween se sequenc quenc quences es and other se sequenc quenc quences es rather than just fixed inputs. such sequence-to-sequence as the LSTM sequence modelseems mentioned e, are used to model This learning to be ab onovthe cuspnow of rev revolutionizing olutionizing relationships betweenmachine sequences and other(Sutskev sequences rather than; just fixed inputs. another application: translation Sutskever er et al. al.,, 2014 Bahdanau et al. al.,, This sequence-to-sequence learning seems to b e on the cusp of rev olutionizing 2015 2015). ). another application: machine translation (Sutskever et al., 2014; Bahdanau et al., This trend of increasing complexit complexity y has been pushed to its logical conclusion 2015). with the introduction of neural Turing machines (Grav Graves es et al. al.,, 2014a) that learn This trend of increasing complexit y has b een pushed to logicalcells. conclusion to read from memory cells and write arbitrary con conten ten tentt to its memory Suc Such h with the introduction of neural T uring machines ( Grav es et al. , 2014a ) that learn neural net netw works can learn simple programs from examples of desired behavior. For to read from andlists write conten t to memory cells. Suc h example, they memory can learncells to sort of arbitrary num umbers bers given examples of scrambled and neural net works canThis learn simple programs technology from examples desired behavior. For sorted sequences. self-programming is inofits infancy infancy, , but in the example, theyincan learn to lists to of nearly numbers future could principle besort applied an any ygiven task.examples of scrambled and sorted sequences. This self-programming technology is in its infancy, but in the Another crowning achiev achievement ement of deep learning is its extension to the domain future could in principle be applied to nearly any task. of reinfor einforccement le learning arning arning.. In the context of reinforcement learning, an autonomous Another crowning achievement is its extension to the domain agen agent t must learn to perform a task of bydeep triallearning and error, without an any y guidance from of r einfor c ement le arning . In the context of reinforcement learning, an autonomous the human op operator. erator. DeepMind demonstrated that a reinforcement learning system agen t must learn to perform a taskofby trial and without anygames, guidance from based on deep learning is capable learning to error, play Atari video reaching the human op erator. DeepMind demonstrated that a reinforcement learning system human-lev uman-level el performance on many tasks (Mnih et al., 2015). Deep learning has basedsignificantly on deep learning isedcapable of learningoftoreinforcement play Atari video games, also improv improved the performance learning for reaching rob robotics otics h uman-lev el p erformance on many tasks ( Mnih et al. , 2015 ). Deep learning has (Finn et al., 2015). also significantly improved the performance of reinforcement learning for robotics Man Many of, these (Finn etyal. 2015).applications of deep learning are highly profitable. Deep learning is now used b by y many top technology companies including Go Google, ogle, Microsoft, Man y ofIBM, theseBaidu, applications deep learning highly and profitable. Faceb acebo ook, Apple,ofAdobe, Netflix,are NVIDIA NEC. Deep learning is now used by many top technology companies including Google, Microsoft, Adv dvances ances in deep learning have e also Netflix, dep depended ended hea heavily vilyand on adv advances ances in softw software are Faceb ook, IBM, Baidu, Apple,hav Adobe, NVIDIA NEC. infrastructure. Softw Software are libraries such as Theano (Bergstra et al., 2010; Bastien A dv ances in deep learning have also dep ended), hea vily(on advert ances in ,softw are et al., 2012), PyLearn2 (Go Goo odfellow et al. , 2013c Torch Collob Collobert et al. 2011b ), infrastructure. are libraries such as, 2013 Theano (Bergstra et al. 2010 ; Bastien DistBelief (DeanSoftw et al. , 2012 ), Caffe (Jia ), MXNet (Chen et ,al. , 2015 ), and etensorFlow al., 2012),(Abadi PyLearn2 odfellow , 2013c), Timp orch (Collob ert et al., jects 2011bor), T et al.(Go , 2015 ) hav haveeetallal.supported importan ortan ortant t researc research h pro projects DistBelief ( Dean et al. , 2012 ), Caffe ( Jia , 2013 ), MXNet ( Chen et al. , 2015 ), and commercial pro products. ducts. TensorFlow (Abadi et al., 2015) have all supported important research pro jects or Deep learning has also made contributions back to other sciences. Mo Modern dern commercial products. con conv volutional netw networks orks for ob object ject recognition provide a mo model del of visual pro processing cessing Deep learning has also made contributions back to other sciences. Modern 25 convolutional networks for ob ject recognition provide a model of visual processing
CHAPTER 1. INTRODUCTION
that neuroscientists can study (DiCarlo, 2013). Deep learning also pro provides vides useful to tools ols for pro processing cessing massiv massivee amounts of data and making useful predictions in that neuroscientists can study (DiCarlo,used 2013to ). predict Deep learning also prowill videsinteract useful scien scientific tific fields. It has been successfully how molecules tools for to pro cessing massive amounts of data andnew making in order help pharmaceutical companies design drugsuseful (Dahlpredictions et al., 2014in ), scien tific fields. It has b een successfully used to predict how molecules will interact to searc search h for subatomic particles (Baldi et al., 2014), and to automatically parse in order toe help pharmaceutical companies new (Dahl et (al. , 2014 ), microscop microscope images used to construct a 3-Ddesign map of thedrugs human brain Kno Knowleswlesto searceth al. for, 2014 subatomic particles (Baldi et al.to, 2014 ), and to automatically parse Barley al., ). We exp expect ect deep learning app appear ear in more and more scientific microscop e images fields in the future. used to construct a 3-D map of the human brain (KnowlesBarley et al., 2014). We expect deep learning to appear in more and more scientific In summary summary,, deep learning is an approac approach h to machine learning that has dra drawn wn fields in the future. hea heavily vily on our knowledge of the human brain, statistics and applied math as it In summary deep learning an approac to machine thattremendous has drawn dev develop elop eloped ed over ,the past sev several eralis decades. In hrecen recent t years, learning it has seen hea vily in onits ourpopularit knowledge of usefulness, the human due brain, statistics applied math as it gro growth wth opularity y and in large partand to more pow owerful erful comdev elop ed o v er the past sev eral decades. In recen t years, it has seen tremendous puters, larger datasets and techniques to train deep deeper er netw networks. orks. The years ahead growth in cits popularit y and usefulness,todue in large part to more pow erful comare full of hallenges and opp opportunities ortunities improv improve e deep learning even further and puters, larger datasets and techniques to train deep er netw orks. The years ahead bring it to new frontiers. are full of challenges and opportunities to improve deep learning even further and bring it to new frontiers.
26
Number of neurons (logarithmic scale)
CHAPTER 1. INTRODUCTION
1011 1010 109 108 107 106 105 104 103 102 101 100 10−1 10−2
Increasing neural netw network ork size ov over er time Human
Increasing neural over time 17 network size20 19
16
8
Octopus
18
14 11
Frog Bee
3
Ant Leech
13 1
2 4
1950
12
6
1985
2000
5
9 7
Roundworm
15
10
2015
2056
Sponge
Year
Figure 1.11: Since the introduction of hidden units, artificial neural netw networks orks hav havee doubled in size roughly every 2.4 years. Biological neural netw network ork sizes from Wikip Wikipedia edia (2015). Figure 1.11: Since the introduction of hidden units, artificial neural networks have doubled 1. Perceptron Rosenblatt , 1962 ) in size roughly (every 2.4, 1958 years. Biological neural network sizes from Wikip edia (2015). 2. Adaptive linear element (Widrow and Hoff, 1960) 3. Neocognitron (Fukushima, 1980) 4. Early back-propagation network (Rumelhart et al., 1986b) 5. Recurrent neural network for speech recognition (Robinson and Fallside, 1991) 6. Multilayer perceptron for speech recognition (Bengio et al., 1991) 7. Mean field sigmoid belief network (Saul et al., 1996) 8. LeNet-5 (LeCun et al., 1998b) 9. Echo state network (Jaeger and Haas, 2004) 10. Deep belief network (Hinton et al., 2006) 11. GPU-accelerated convolutional network (Chellapilla et al., 2006) 12. Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a) 13. GPU-accelerated deep belief network (Raina et al., 2009) 14. Unsupervised convolutional network (Jarrett et al., 2009) 15. GPU-accelerated multilayer perceptron (Ciresan et al., 2010) 16. OMP-1 network (Coates and Ng, 2011) 17. Distributed autoencoder (Le et al., 2012) 18. Multi-GPU convolutional network (Krizhevsky et al., 2012) 19. COTS HPC unsupervised convolutional network (Coates et al., 2013) 20. GoogLeNet (Szegedy et al., 2014a)
27
ILSVRC classification error rate
CHAPTER 1. INTRODUCTION
0.30 0.25
Decreasing error rate ov over er time Decreasing error rate over time
0.20 0.15 0.10 0.05 0.00 2010
2011
2012
2013
2014
2015
Year
Figure 1.12: Since deep net netw works reached the scale necessary to comp compete ete in the ImageNet Large Scale Visual Recognition Challenge, they hav havee consistently won the comp competition etition Figure 1.12: and Since deep net wer orks the scale to comp in the ImageNet ev every ery year, yielded low lower andreached low lower er error ratesnecessary each time. Dataete from Russak Russakovsky ovsky Large Recognition they have consistently won the comp etition et al. (Scale 2014bVisual ) and He et al. (2015Challenge, ). every year, and yielded lower and lower error rates each time. Data from Russakovsky et al. (2014b) and He et al. (2015).
28
Part I Part I
Applied Math and Mac Machine hine Learning Basics Applied Math and Machine Learning Basics
29 29
This part of the book in intro tro troduces duces the basic mathematical concepts needed to understand deep learning. We begin with general ideas from applied math that This part of the book inof tromany duces vthe basic find mathematical needed to allo allow w us to define functions ariables, the highestconcepts and low lowest est points understand deep learning. We bdegrees egin with on these functions and quantify of bgeneral elief. ideas from applied math that allow us to define functions of many variables, find the highest and lowest points Next, we describ describee the fundamen fundamental tal goals of machine learning. We describe how on these functions and quantify degrees of belief. to accomplish these goals by sp specifying ecifying a mo model del that represen represents ts certain beliefs, Next, w e describ e the fundamen tal goals of machine learning. We describe how designing a cost function that measures how well those beliefs corresp correspond ond with to accomplish these goals balgorithm y specifying mo del that ts certain beliefs, realit reality y and using a training to aminimize that represen cost function. designing a cost function that measures how well those beliefs correspond with This elementary framew framework ork is the basis for a broad of mac machine hine learning realit y and using a training algorithm to minimize thatvariety cost function. algorithms, including approac approaches hes to machine learning that are not deep. In the This elementary ork basis for a broad variety of machine learning subsequen subsequent t parts of framew the bo book, ok, is wethe develop deep learning algorithms within this algorithms, including approac hes to machine learning that are not deep. In the framew framework. ork. subsequent parts of the book, we develop deep learning algorithms within this framework.
30
Chapter 2 Chapter 2
Linear Algebra Linear Algebra Linear algebra is a branc branch h of mathematics that is widely used throughout science and engineering. Ho How wev ever, er, because linear algebra is a form of contin continuous uous rather Linear algebramathematics, is a branch of mathematics is widely than discrete man many y computer that scientists hav haveeused littlethroughout exp experience erience science with it. and engineering. Ho w ev er, b ecause linear algebra is a form of contin uous A go goo od understanding of linear algebra is essential for understanding and wrather orking than discrete mathematics, man y computer scientists hav e little exp erience with with man many y mac machine hine learning algorithms, esp especially ecially deep learning algorithms. Wit. e A go o d understanding of linear algebra is essential for understanding and w orking therefore precede our in intro tro troduction duction to deep learning with a fo focused cused presentation of withkey man y mac hine learning algorithms, especially deep learning algorithms. We the linear algebra prerequisites. therefore precede our introduction to deep learning with a focused presentation of If you are already familiar with linear algebra, feel free to skip this chapter. If the key linear algebra prerequisites. you hav havee previous exp experience erience with these concepts but need a detailed reference If y are already familiar we with linear algebra, feel freeCo tookb skip chapter. If sheet tooureview key formulas, recommend The Matrix (Petersen and Cookb okbo ookthis you have, previous erience these needalgebra, a detailed P edersen 2006). Ifexp you ha hav ve with no exp exposure osureconcepts at all tobut linear thisreference chapter sheet to review key formulas, we recommend The Matrix Co okb o ok ( Petersen and will teac teach h you enough to read this bo ok, but we highly recommend that you also P edersen , 2006 ). If y ou ha v e no exp osure at all to linear algebra, this chapter consult another resource fo focused cused exclusiv exclusively ely on teaching linear algebra, such as will teac h y ou enough to read this b o ok, but we highly that you also Shilo Shilov v (1977). This chapter will completely omit man many y recommend imp importan ortan ortantt linear algebra consultthat another resource fo cused exclusively on teaching linear algebra, such as topics are not essential for understanding deep learning. Shilov (1977). This chapter will completely omit many imp ortant linear algebra topics that are not essential for understanding deep learning.
2.1
Scalars, Vectors, Matrices and Tensors
The of linear V algebra inv involv olv olves es several types of mathematical objects: jects: 2.1 study Scalars, ectors, Matrices and Tensors ob The study of linear algebra involves several types of mathematical ob jects: • Sc Scalars alars alars:: A scalar is just a single num umb b er, in contrast to most of the other ob objects jects studied in linear algebra, whic which h are usually arrays of multiple num numbers. bers. Sc alars : A scalar is just a single n um b er, in contrast to most of the other We write scalars in italics. We usually give scalars low lower-case er-case variable names. ob jectswe studied in linear algebra, whic h are usually arrays ofbm • When in intro tro troduce duce them, we sp specify ecify what kind of num numb erultiple they num are. bers. For We write scalars in italics. We usually give scalars lower-case variable names. When we introduce them, we specify 31 what kind of number they are. For 31
CHAPTER 2. LINEAR ALGEBRA
example, we migh mightt sa say y “Let s ∈ R be the slop slopee of the line,” while defining a real-v real-valued alued scalar, or “Let n ∈ NRbe the num numb ber of units,” while defining a s example, w e migh t sa y “Let b e the slop e of the line,” while defining a natural num umb ber scalar. N n real-valued scalar, or “Let ∈ be the number of units,” while defining a natural ber scalar. ectors ctors::num A vector is an array numb bers. The num numb b ers are arranged in • V ∈ of num order. We can iden identify tify eac each h individual num numb ber by its index in that ordering. V e ctors : A vector is an array of num b ers. The num arranged in Typically we give vectors low lower er case names written inb ers boldare typeface, suc such h order. W e can iden tify eac h individual num b er by its index in that ordering. • as x. The elements of the vector are iden identified tified by writing its name in italic ypically we give vectors low er first case element names written bold typeface, such tTyp ypeface, eface, with a subscript. The of x is xin 1 , the second element as xx. and The so elements the need vectortoare tified by of writing in italic is on. Weofalso sayiden what kind num umb bits ers name are stored in 2 x x tthe ypeface, with a subscript. The first element of is , the second element vector. If each element is in R, and the vector has n elemen elements, ts, then the x is and so on. W e also need to say what kind of n um b ers are in n times, vector lies in the set formed by taking the Cartesian pro product duct of Rstored R the vector. Ifneach element is into explicitly , and the iden vector elements, then the denoted as R . When we need identify tifyhas the nelements of R a vector, vector liesthem in the set formed by taking the Cartesian product of n times, w e write brackets: ets: R as a column enclosed in square brack denoted as . When we need to explicitly identify the elements of a vector, we write them as a column enclosed inx1square brackets: x2 x = x. . (2.1) x.. x = x.n . (2.1) .. x p oin ts in space, with each element We can think of vectors as identifying oints giving the coordinate along a different axis. ts in space, with each element We can think of vectors as identifying p oin elements Sometimes we need toalong indexa adifferent set of of a vector. In this case, we giving the coordinate axis. define a set con containing taining the indices and write the set as a subscript. For Sometimes we need index a set of elements of a vector. In this case, we example, to access xto 1 , x3 and x6 , we define the set S = { 1, 3, 6} and write define a set containing the indices and write the set as a subscript. For x S . We use the − sign to index the complement of a set. For example x−1 is x ,all x elemen = x 1, 3, 6is the example, access and xts, we andvwrite x1 ,Sand the vectortocon containing taining elements of xdefine exceptthe forset ector −S . W e use the sign to index the complement of a set. F or example is x x { } con containing taining all of the elements of x except for x1, x 3 and x6 . the vector containing all elements of x except for x , and x is the vector − • Matric con taining of theiselements ofyxofexcept for so x ,eac x hand x .t is identified by Matrices es es:: Aall matrix a 2-D arra array num numb bers, each elemen element two indices instead of just one. We usually giv givee matrices upp upper-case er-case variable Matric es : A matrix is a 2-D arra y of num b ers, so eac h elemen t identified by names with bold typ ypeface, eface, suc such h as A . If a real-v real-valued alued matrix Ais has a heigh height t twomindices of njust one.we Wsay e usually giv∈e matrices upp er-case identify variable • of and a instead width of , then that A e usually Rm×n. W names withtsbof oldatmatrix ypeface,using such its as A . If ainreal-v matrix A has a height the elemen elements name italicalued Rbut not bold font, and the of m and widthwith of nseparating , then we say that AFor example, . WeAusually identify indices area listed commas. upper er 1,1 is the upp the elemen ts of a matrix using its name in italic but not b old font, and the ∈ left entry of A and Am,n is the bottom righ rightt entry entry.. We can identify all of A the indices are listed separating commas. For example, is horizon the upptal er i by writing the num numb bers withwith vertical co coordinate ordinate a “ :” for horizontal A A left entry ofFA is the bottom righ t entry We can identify of co coordinate. ordinate. or and example, horizontal tal. cross section of A all with i,: denotes the horizon the numco bers with vertical cokno ordinate by writing ” for the horizon v ertical coordinate ordinate known wn asi the . Likewise, is i. This is i-th rowaof“ :A A:,ital coordinate. For example, A denotes the horizontal cross section of A with vertical co ordinate i. This is known as the i-th row of A. Likewise, A is 32
CHAPTER 2. LINEAR ALGEBRA
2
A1,1 A = 4 A2,1 A3,1
3 A1,2 A1,1 A2,2 5 ) A> = A1,2 A3,2
A2,1 A2,2
A3,1 A3,2
Figure 2.1: The transp transpose ose of the matrix can be thought of as a mirror image across the main diagonal. Figure 2.1: The transpose of the matrix can be thought of as a mirror image across the main diagonal.
the i -th column of A . When we need to explicitly iden identify tify the elemen elements ts of a matrix, we write them as an array enclosed in square brac brack kets: the i -th column of A . When we need to explicitly identify the elements of a matrix, we write them as an array A 1,enclosed 1 A 1,2 in square brackets: . (2.2) A 2,1 A 2,2 A A . (2.2) A alued expressions that are not just Sometimes we ma may y need to indexAmatrix-v matrix-valued a single letter. In this case, we use subscripts after the expression, but do Sometimes may need to index matrix-v alued expressions that are not just f (A )i,j giv j) not con conv vertwe anything to low lower er case. For example, gives es elemen element t (i, after a single letter. In this case, we use subscripts the expression, but do of the matrix computed by applying the function f to A. not convert anything to lower case. For example, f (A ) gives element (i, j ) • T ofensors the matrix computed by applying function to A.than tw ensors: : In some cases w e will need the an arra array y withf more two o axes. In the general case, an array of num numb bers arranged on a regular grid with a Tariable ensors:num In some weknown will need arrayW with moreathan twnamed o axes. “A” In v numb ber ofcases axes is as aan tensor. e denote tensor the general case, anA.array of numbthe ers element arrangedofon a regular grid (with • with A at i, j, ka) this typeface: We identify co coordinates ordinates variable num ber. of axes is known as a tensor. We denote a tensor named “A” b y writing Ai,j,k A A with this typeface: . We identify the element of at co ordinates (i, j, k ) A by writing . One imp important ortant op operation eration on matrices is the tr transp ansp anspose ose ose.. The transp transpose ose of a matrix is the mirror image of the matrix across a diagonal line, called the main One ,imp ortantdown operation is the tr anspits oseupp . The transp ose of a diagonal diagonal, running and toonthematrices righ right, t, starting from upper er left corner. See matrix mirror image of theofmatrix across a diagonal line, the Fig. 2.1isforthe a graphical depiction this op operation. eration. We denote thecalled transp transpose osemain of a diagonal , running down and to the righ t, starting from its upp er left corner. See > matrix A as A , and it is defined such that Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a > matrix A as A , and it is defined(Asuch )i,jthat = Aj,i. (2.3) ) =A . contain only one column. (2.3) Vectors can be though thoughtt of as(A matrices that The transp transpose ose of a vector is therefore a matrix with only one row. Sometimes we Vectors can be thought of as matrices that contain only one column. The 33 transpose of a vector is therefore a matrix with only one row. Sometimes we
CHAPTER 2. LINEAR ALGEBRA
define a vector by writing out its elements in the text inline as a ro row w matrix, then using the transp transpose ose op operator erator to turn it in into to a standard column vector, e.g., define >by writing out its elements in the text inline as a row matrix, x = [[x x1a, xvector 2, x3 ] . then using the transpose operator to turn it into a standard column vector, e.g., a single entry entry.. From this, we x =A[xscalar , x , xcan ] .be thought of as a matrix with only > can see that a scalar is its own transp transpose: ose: a = a . A scalar can be thought of as a matrix with only a single entry. From this, we We can add matrices to each other, as long as they ha hav ve the same shap shape, e, just can see that a scalar is its own transpose: a = a . by adding their corresp corresponding onding elemen elements: ts: C = A + B where Ci,j = Ai,j + Bi,j . We can add matrices to each other, as long as they have the same shape, just We can also add a scalar to a matrix or multiply a matrix by a scalar, just by adding their corresp onding elements: C = A + B where C = A + B . by performing that op operation eration on eac each h element of a matrix: D = a · B + c where W e can also add a scalar to a matrix or multiply a matrix by a scalar, just Di,j = a · Bi,j + c. by performing that operation on each element of a matrix: D = a B + c where context text of deep learning, we also use some less conv conventional entional notation. D In= the a Bcon + c. · We allo allow w the addition of matrix and a vector, yielding another matrix: C = A + b, · In the text+ofb deep learning, we also use some less conventional notation. Ci,j con =A where i,j j . In other words, the vector b is added to each row of the C = Ain +to b, We allowThis the addition of eliminates matrix andthe a vector, matrix. shorthand need toyielding define aanother matrixmatrix: with b copied into Cw b= A doing + b .the b is added where In addition. other words, vectorcopying of the eac each h ro row efore Thisthe implicit of bto toeach man many yrow lo locations cations matrix. This shorthand eliminates the need to define a matrix with copied into b is called br bro oadc adcasting asting asting.. each row before doing the addition. This implicit copying of b to many locations is called broadcasting.
2.2
Multiplying Matrices and Vectors
One most imp important ortant op operations erations inv involving olving matrices is multiplication of two 2.2 of the Multiplying Matrices and Vectors matrices. The matrix pr pro oduct of matrices A and B is a third matrix C . In order Onethis of the mosttoimp opA erations inveolving matrices of has two for pro product duct beortant defined, must hav have the same num numb bis er multiplication of columns as B C . In matrices. duct of matrices and thirdCmatrix mo× n and × p. B is of A ro rows. ws. If AThe is ofmatrix shap shapee pr shap shape e nB×isp,athen is of shap shape e morder A B for this pro duct to b e defined, m ust hav e the same num b er of columns as has We can write the matrix pro product duct just by placing tw two o or more matrices together, m n n p m p. A B C ro ws. If is of shap e and is of shap e , then is of shap e e.g. We can write the matrix × product just placing two or more matrices together, × × C =byAB . (2.4) e.g. (2.4) The pro product duct op operation eration is definedCb= y AB . X The product operation is defined Ci,j = by A i,k B k,j. (2.5) k
C = A B . (2.5) Note that the standard pro product duct of tw twoo matrices is not just a matrix con containing taining the pro product duct of the individual elements. Suc Such h an op operation eration exists and is called the Note that the standard pro duct of tw o matrices is not just element-wise pr pro oduct or Hadamar Hadamard d pr pro oduct, and is denoted as aAmatrix B . containing XSuch an operation exists and is called the the product of the individual elements. x y of the same dimensionalit The dot pr pro o duct b etw etween een t w o vectors y is the element-wise product or Hadamard productand , and is denoted asdimensionality A B. > matrix pro product duct x y . We can think of the matrix pro product duct C = AB as computing y column dot dot product betwbeen wo vectors of the same y is the Ci,j The as the pro product duct et etw wteen ro row w i ofxAand and j of dimensionalit B. matrix product x y . We can think of the matrix pro duct C = AB as computing 34 A and column j of B . C as the dot pro duct between row i of
CHAPTER 2. LINEAR ALGEBRA
Matrix pro product duct op operations erations hav havee many useful prop properties erties that make mathematical analysis of matrices more con convenien venien venient. t. For example, matrix m multiplication ultiplication is Matrix pro duct op erations hav e many useful prop erties that make mathematical distributiv distributive: e: analysis of matrices more A con venien For example, is (B + C )t.= AB + AC . matrix multiplication (2.6) distributive: It is also asso associativ ciativ ciative: e: A(B + C ) = AB + AC . (2.6) A(B C ) = (AB )C . (2.7) It is also asso ciative: AB = B A do Matrix multiplication is not commutativ commutative (the does es(2.7) not A(B C ) = (eAB )Ccondition . alw alwa ays hold), unlik unlikee scalar multiplication. Ho Howev wev wever, er, the dot pro product duct betw etween een tw two o AB = B A Matrix multiplication is not commutativ e (the condition do es not vectors is comm commutativ utativ utative: e: > wever, the dot pro duct b etween two always hold), unlike scalar multiplication. x>y = yHo x. (2.8) vectors is commutative: x y has = y a xsimple . (2.8) The transp transpose ose of a matrix pro product duct form: (AB )> has = Ba>simple A >. form: The transpose of a matrix pro duct
(2.9)
AB )2.8= A . This allo allows ws us to demonstrate (Eq. , bB y exploiting the fact that the value of (2.9) suc such h a pro product duct is a scalar and therefore equal to its own transp transpose: ose: This allows us to demonstrate Eq. 2.8, by exploiting the fact that the value of > such a product is a scalar and x>therefore y = x >yequal=toy>its x.own transpose: (2.10)
x y= x y = y x. (2.10) Since the fo focus cus of this textb textbo ook is not linear algebra, we do not attempt to dev develop elop a comprehensive list of useful prop properties erties of the matrix pro product duct here, but Since the fo cus of this textb o ok is not linear algebra, we do not attempt to the reader should b e aware that many more exist. develop a comprehensive list of useful properties of the matrix product here, but notation We no now w kno know w enough linear algebra to write down a system of linear the reader should b e aware that many more exist. equations: We now know enough linear algebra to write down a system of (2.11) linear Axnotation =b equations: n b ∈=Rbm is a known vector, and x ∈ R(2.11) where A ∈ R m×n is a known matrix,Ax is a vector of unknown variables we would like R to solve for. Each elemen elementt xi of xRis one R A b x another where is a known matrix, is a known vector, is a A of these unknown variables. Each row of and eac each h element of b and pro provide vide vector of t. unknown ariables Eq. we w2.11 ouldas: like ∈ ∈ to solve for. Each element x of∈ x is one constrain constraint. We canvrewrite of these unknown variables. Each row of A and each element of b provide another constraint. We can rewrite Eq. 2.11Aas: (2.12) 1,: x = b1
or, ev even en more explicitly explicitly,, as:
A2,: x = b2
(2.12) (2.13)
A .x. .= b A m,:x . . .= bm
(2.14) (2.13) (2.15) (2.14)
A
(2.15)
x=b
A,1as: or, even more explicitly ,1 x1 + A 1,2x 2 + · · · + A 1,nx n = b1 A
x +A
35 x + +A ···
x =b
(2.16) (2.16)
CHAPTER 2. LINEAR ALGEBRA
1 0 01 0 0
0 1 00 1 0
0 0 10 0 1
Figure 2.2: Example identity matrix: This is I 3 .
Figure 2.2: Example identity matrix: This is I .
A2,1 x1 + A 2,2x 2 + · · · +A 2,nx n = b2
... + A x = b x + A m,1x1 + Am,2x 2 +. .··.···· + A m,nxn = bm . A
x +A
(2.17) (2.18) (2.17) (2.19) (2.18)
Aductx notation + A xpro x compact = b . representation +vides + aAmore (2.19) Matrix-v Matrix-vector ector pro product provides for equations of this form. ··· Matrix-vector product notation provides a more compact representation for equations of this form.
2.3
Iden Identit tit tity y and In Inverse verse Matrices
Linear offers a pow owerful tool Matrices called matrix inversion that allows us to 2.3 algebra Identit y and Inerful verse analytically solv solvee Eq. 2.11 for many values of A. Linear algebra offers a powerful tool called matrix inversion that allows us to To describ describee matrix in inv version, we first need to define the concept of an identity analytically solve Eq. 2.11 for many values of A. matrix matrix.. An identit identity y matrix is a matrix that do does es not change any vector when we To describ e matrix version, we first need tothe define the of anpreserves identity multiply that vector byin that matrix. We denote iden identit tit tity y concept matrix that matrix . An identit y matrix matrix, Ithat dones ×n,not n -dimensional vectors as In.isFaormally ormally, andchange any vector when we n∈R multiply that vector by that matrix. We denote the identity matrix that preserves R n-dimensional vectors as I . Formally ∀x ∈ R,nI, In x = x., and (2.20) ∈ R x I x = xall . of the entries along the(2.20) The structure of the identit identity y matrix is ,simple: main ∀ ∈ entries are zero. See Fig. 2.2 for an example. diagonal are 1, while all of the other The structure of the identity matrix is simple: all of the entries along the main The matrix inverse of A is denoted as A−1, and it is defined as the matrix diagonal are 1, while all of the other entries are zero. See Fig. 2.2 for an example. suc such h that The matrix inverse of A is denoted , and it is defined as the matrix (2.21) A −1 Aas = IA n. such that (2.21) A = I . steps: We can now solve Eq. 2.11 by A the following We can now solve Eq. 2.11 by the following steps: Ax = b −1 A −1 Ax Ax = =A b b −1
(2.22) (2.23) (2.22)
== AA b b A In x Ax
(2.24) (2.23)
36A b I x=
(2.24)
CHAPTER 2. LINEAR ALGEBRA
x = A−1b.
(2.25)
= Apossible b. (2.25) Of course, this dep depends ends on it x being to find A−1. We discuss the −1 conditions for the existence of A in the follo following wing section. Of course, this depends on it being possible to find A . We discuss the −1 When Afor the exists, severalofdifferent exist for finding it in closed form. conditions existence A in algorithms the following section. In theory theory,, the same in inv verse matrix can then b e used to solv solvee the equation many A When exists, several different algorithms exist for form. −1 times for different values of b . How However, ever, A is primarilyfinding useful it asina closed theoretical Inol, theory the same verse matrix caninthen b e used solvsoftw e theare equation many to tool, and ,should notinactually b e used practice for to most software applications. b A times for different v alues of . How ever, is primarily useful as a theoretical Because A−1 can b e represented with only limited precision on a digital computer, tool, and should not actually e used in of practice mostobtain software applications. b can for algorithms that make use of bthe value usually more accurate Because can b e represented with only limited precision on a digital computer, A estimates of x. algorithms that make use of the value of b can usually obtain more accurate estimates of x.
2.4
Linear Dep Dependence endence and Span
In for A−1 to exist, Eq. 2.11 must have e exactly one solution for every value 2.4orderLinear Dep endence andhav Span of b. How However, ever, it is also possible for the system of equations to hav havee no solutions A In order for to exist, Eq. 2.11 must hav e exactly one solution value or infinitely many solutions for some values of b. It is not possiblefor to every ha have ve more of b. one Howbut ever, it than is alsoinfinitely possibleman for ythe system for of equations tobhav no solutions x and y than less many solutions a particular ; if eboth b or infinitely many solutions for some v alues of . It is not p ossible to ha ve more are solutions then than one but less than infinitelyzman and y = αyxsolutions + (1 − α)for y a particular b ; if both x (2.26) are solutions then is also a solution for any real αz. = αx + (1 α)y (2.26) To analyze ho how w man many y solutions the equation − has, we can think of the columns is also a solution for any real α. of A as sp specifying ecifying different directions we can tra trave ve vell from the origin (the point T o analyze ho w man y solutions the equation has, wewcan think of the columns sp specified ecified by the vector of all zeros), and determine ho how many wa ways ys there are of A of as sp ecifying different directions we can tra ve l from the origin (the peloint reac reaching hing b. In this view, each element of x sp specifies ecifies ho how w far we should trav travel in sp ecified by the vector ofwith all zeros), and determine many ys direction there are of of xi sp eac each h of these directions, specifying ecifying how far ho towmo mov ve inwa the reachingi:b. In this view, each element of x specifies how far we should travel in column X each of these directions, with xAx sp= ecifying how. far to move in the direction of x iA (2.27) :,i column i: i Ax = xA . (2.27) In general, this kind of op operation eration is called a line linear ar combination ombination.. Formally ormally,, a linear com combination bination of some set of vectors {v (1) , . . . , v(n) } is given by multiplying each In general, of onding op eration is called a line combination Formally, a linear vector v(i) bthis y a kind corresp corresponding scalar co coefficien efficien efficient t ar and adding the. results: combination of some set of vectors v , . . . , v is given by multiplying each XX ( i ) vector v by a corresponding scalar{coefficien } adding the results: ci v . t and (2.28) i
cv . (2.28) The sp span an of a set of vectors is the set of all points obtainable by linear combination of the original vectors. The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors. X37
CHAPTER 2. LINEAR ALGEBRA
Determining whether Ax = b has a solution th thus us amoun amounts ts to testing whether b is in the span of the columns of A. This particular span is known as the column Determining whether sp spac ac ace e or the range of A. Ax = b has a solution thus amounts to testing whether b is in the span of the columns of A. This particular span is known as the column Ax = b to ha order theofsystem hav ve a solution for all values of b ∈ R m , spacIn e or the rfor ange A. Rm we therefore require that the column space of A be all of R m . If any p oin ointt in R = b that b has, In order from for the to hapvoint e a solution for allvalue values b that is excluded thesystem columnAx space, is a potential of of R R we solution. therefore require that the column space of A space be all of of A b.eIfallany in m timplies ∈ no The requirement that the column of pRoin is excluded from column space, thatmpoint is a pi.e., otential value of b that has Am immediately thatthe ust hav have e at least columns, . Otherwise, the n≥m R A no solution. The requirement that the column space of b e all of implies dimensionalit dimensionality y of the column space would be less than m. For example, consider a mustb hav e at but leastx m n difying m . Otherwise, 3immediately x × 2 matrix. that The A target is 3-D, is columns, only 2-D, i.e., so mo modifying the value ofthe dimensionalit y of the column space would b e less than . F or example, consider m 3 ≥ at best allows us to trace out a 2-D plane within R . The equation has a solutiona 3 and b isplane. 2 matrix. 3-D, but x is only 2-D, so mo difying the value of x if only if bThe liestarget on that R at×best allows us to trace out a 2-D plane within . The equation has a solution if and only nif ≥ b lies thata plane. m isononly Ha Having ving necessary condition for ev every ery poin ointt to ha have ve a solution. It is not a sufficient condition, because it is possible for some of the columns to be m is only Having condition every point to ve a solution. redundan redundant. t. nConsider a 2 ×a 2necessary matrix where b othfor of the columns arehaequal to each It is not a sufficient condition, b ecause it is p ossible for some of the columns bye ≥ other. This has the same column space as a 2 × 1 matrix containing only one to cop copy redundan t. Consider a 2 2 matrix where b oth of the columns are equal to each of the replicated column. In other words, the column space is still just a line, and other.toThis has theall same column as there a 2 1are matrix containing only one copy 2 × fails encompass of R , ev even en space though tw two o columns. of the replicated column. In other words, the × column space is still just a line, and R redundancy is known as line Formally ormally,, this kind of linear ar dep dependenc endenc endencee. A set of fails to encompass all of , even though there are two columns. vectors is line linearly arly indep independent endent if no vector in the set is a linear combination of the F ormally , this kind redundancy known linear dependencof e. the A set of other vectors. If we add of a vector to a setis that is aas linear combination other vectors ectors is endent if nodo ves ector theany set pisoints a linear combination the v in line thearly set, indep the new vector does notinadd to the set’s span.ofThis other vthat ectors. wecolumn add a vector to the a set that is linear combination thematrix other means for Ifthe space of matrix toaencompass all of Rm,ofthe v ectors in the set, the new vector do es not add any p oints to the set’s span. This must con contain tain at least one set of m linearly independent columns. R This condition means for theand column spacefor of the encompass all of , thevmatrix is b oththat necessary sufficient Eq. matrix 2.11 totoha hav ve a solution for every alue of m.ust conthat tain at one set ofismforlinearly independent columns. This condition b m linear Note theleast requirement a set to hav havee exactly indep independent endent is b oth necessary andmsufficient Eq. 2.11 to havveectors a solution fore every alue m of columns, not at least . No set for of m -dimensional can hav have more v than b m Note that the indep requirement is for a set e exactly linear endent m.utually linearly independen enden endentt columns, buttoahav matrix with more thanindep m columns m m m columns, not at least . No set of -dimensional v ectors can hav e more than ma may y ha hav ve more than one such set. mutually linearly indep endent columns, but a matrix with more than m columns mayInhaorder ve more thanmatrix one such set.e an in for the to hav have inv verse, we additionally need to ensure that Eq. 2.11 has at most one solution for each value of b. To do so, we need to ensure order for the e an invOtherwise erse, we additionally need to ensure that thatInthe matrix has matrix at mosttomhav columns. there is more than one wa way y of b Eq. 2.11 has at most one solution for each v alue of . T o do so, we need to ensure parametrizing eac each h solution. that the matrix has at most m columns. Otherwise there is more than one way of Together, this means that the matrix must be squar squaree, that is, we require that parametrizing each solution. independent. endent. A square matrix m = n and that all of the columns must b e linearly indep T ogether, this means that the matrix must b e squar e , that is, we require that with linearly dependent columns is known as singular. m = n and that all of the columns must b e linearly independent. A square matrix If A is not square or is square but singular, it can still b e possible to solve the with linearly dependent columns is known as singular. 38 If A is not square or is square but singular, it can still b e possible to solve the
CHAPTER 2. LINEAR ALGEBRA
equation. How However, ever, we can not use the metho method d of matrix inv inversion ersion to find the solution. equation. However, we can not use the method of matrix inversion to find the So far we hav havee discussed matrix in inv verses as b eing multiplied on the left. It is solution. also possible to define an inv inverse erse that is multiplied on the righ right: t: So far we have discussed matrix inverses as b eing multiplied on the left. It is also possible to define an inverse that on the right: AAis−1multiplied = I. (2.29) AA = I . inv For square matrices, the left inv inverse erse and right inverse erse are equal.
(2.29)
For square matrices, the left inverse and right inverse are equal.
2.5
Norms
Sometimes we need to measure the size of a vector. In mac machine hine learning, we usually 2.5 Norms measure the size of vectors using a function called a norm. Formally ormally,, the Lp norm Sometimes is giv given en by we need to measure the size of a vector. In machine learning, we usually ! 1a norm. Formally, the L norm measure the size of vectors using a function called p X is given by ||x|| p = |xi |p (2.30) x
=
i
x (2.30) for p ∈ R, p ≥ 1. || || | | R including the L p norm, are functions mapping vectors to non-negative Norms, for p , p 1. x measures the distance from values. On an intuitiv intuitivee lev level, el, the norm of a vector !mapping ∈ ≥ L Norms, including the norm, are functions vectors to X the origin to the poin ointt x. More rigorously rigorously,, a norm is any function that satisfies f non-negative values. On anprop intuitiv e level, the norm of a vector x measures the distance from the follo following wing properties: erties: the origin to the point x. More rigorously, a norm is any function f that satisfies the•follo erties: f (xwing ) = 0prop ⇒x =0
• f (x)+=y0) ≤ fx(x=) 0 + f (y ) (the triangle ine inequality quality quality)) • ⇒ fα (x∈+Ry, )f (αx f ()x=) + • ∀ |αf|f((yx))(the triangle inequality) • R ≤ , f (αxwith )= α f (2x p= Theα L2 norm, 2,,) is known as the Euclide Euclidean an norm norm.. It is simply the • ∀ ∈ distance from | |the origin to the poin Euclidean ointt iden identified tified by x. The L 2 norm is L p The norm, with = 2 , is known as the Euclide an norm . It isassimply the used so frequently in mac machine hine learning that it is often denoted simply ||x||, with x. ofThe L norm Euclidean distance from the to the poin t identified is the subscript 2 omitted. It isorigin also common to measure theby size a vector using usedsquared so frequently in mac hine learning that it is simply often denoted the L2 norm, whic which h can b e calculated as x>x.simply as x , with the subscript 2 omitted. It is also common to measure the size of a vector || || using L2 squared moreb econv convenient enient to workaswith mathematically and the The squared L norm,norm whicis h can calculated simply x x . 2 computationally than the L norm itself. For example, the deriv derivatives atives of the The squared norm is more conv enient to work with mathematically L2 normL with x squared resp to each element of eac dep the respect ect each h depend end only on and L computationally than the norm itself. F or example, the deriv atives of the 2 corresp corresponding onding elemen elementt of x, while all of the deriv derivativ ativ atives es of the L norm dep depend end L x squared norm with resp ect to each element of eac h dep end only on the 2 on the en entire tire vector. In many contexts, the squared L norm ma may y be undesirable corresponding element of x, while all of the derivatives of the L norm depend on the entire vector. In many contexts,39 the squared L norm may be undesirable
CHAPTER 2. LINEAR ALGEBRA
because it increases very slowly near the origin. In sev several eral machine learning applications, it is imp importan ortan ortantt to discriminate b et etw ween elements that are exactly because increases slowlybut near the origin. several zero and it elements thatvery are small nonzero. In theseIncases, we machine turn to a learning function applications, it is imp ortan t to discriminate b et w een elements that are exactly that gro grows ws at the same rate in all lo locations, cations, but retains mathematical simplicity: zero and elements that are small but nonzero. In these cases, w e turn to a function the L1 norm. The L1 norm ma may y be simplified to that grows at the same rate in all locations, but retains mathematical simplicity: X the L norm. The L norm may ||bxe||simplified |xito |. (2.31) 1= i
x = x . (2.31) The L1 norm is commonly used in machine learning when the difference b etw etween een || || | | zero and nonzero elements is very imp importan ortan ortant. t. Every time an element of x mo mov ves norm usedincreases in machine when the difference between aThe wayLfrom 0 bis y commonly , the L1 norm by learning . zero and nonzero elements is very importan t. Every time an element of x moves X We sometimes measure the size of the vector by coun counting ting its num umb ber of nonzero away from 0 by , the L norm increases by . 0 norm,” but this is incorrect L elemen Some authors refer to this function as the “ elements. ts. We sometimes size of the vectorinbya coun tingis its ber of bnonzero terminology terminology. . The measure num numb ber the of non-zero entries vector notnum a norm, ecause L elemen ts. Some authors refer to this function as the “ norm,” but this is incorrect scaling the vector by α do does es not change the num umb ber of nonzero en entries. tries. The L 1 terminology . The num b er of non-zero entries in a v ector is not a norm, because norm is often used as a substitute for the number of nonzero en entries. tries. scaling the vector by α does not change the number of nonzero entries.∞The L One other norm that commonly arises in machine learning is the L norm, norm is often used as a substitute for the number of nonzero entries. also known as the max norm. This norm simplifies to the absolute value of the Onet with otherthe norm thatmagnitude commonlyinarises in machine learning is the L norm, elemen element largest the vector, also known as the max norm. This norm simplifies to the absolute value of the element with the largest magnitude vector, ||x||∞in=the |xi |. (2.32) max i
x (2.32) = max x . Sometimes we may also wish to measure the size of a matrix. In the con context text || || | | of deep learning, the most common wa way y to do this is with the otherwise obscure Sometimes we may also wish to measure the size of a matrix. In the context Frob obenius enius norm sX of deep learning, the most common way to do this is with the otherwise obscure A 2i,j , ||A|| F = (2.33) Frobenius norm i,j
A , A = (2.33) whic which h is analogous to the L2 norm of a vector. || || producttoofthe twoLvectors rewritten in terms of norms. Sp Specifically ecifically ecifically,, whicThe h isdot analogous normcan of abevector. sX The dot product of two vectors can rewritten ||xbe || 2|| y|| 2 cos θin terms of norms. Specifically x>y = (2.34), y = yx y cos θ where θ is the angle betw etween een x x and . || || || || where θ is the angle between x and y .
2.6
Sp Special ecial Kinds of Matrices and Vectors
Some special ecial kindsKinds of matrices vectors areand particularly useful. 2.6 sp Sp ecial of and Matrices Vectors 40 are particularly useful. Some special kinds of matrices and vectors
(2.34)
CHAPTER 2. LINEAR ALGEBRA
Diagonal matrices consist mostly of zeros and hav havee non-zero entries only along the main diagonal. F Formally ormally ormally,, a matrix D is diagonal if and only if Di,j = 0 for consistseen mostly zeros and e non-zero entriesthe only alongy all iDiagonal Weematrices hav havee already one ofexample of hav a diagonal matrix: identit identity 6= j . W D D the main diagonal. Formally , a entries matrixare is and if 0 for (v)only matrix, where all of the diagonal 1. diagonal We write ifdiag diag( to denote a=square all . W e hav e already seen one example of a diagonal matrix: the identit i = j diagonal matrix whose diagonal entries are given by the en entries tries of the vector vy. matrix, all ofare theofdiagonal 1. Wemultiplying write diag(vby ) to denote a square 6 where Diagonal matrices interest entries in partare because a diagonal matrix v. diagonal matrix whose diagonal entries are given b y the en tries of the vector is very computationally efficien efficient. t. To compute diag diag((v)x , we only need to scale each Diagonal are of interest indiag( part by aa square diagonal matrix ( vb )xecause = v multiplying elemen element t xmatrices Inverting erting diagonal x. Inv i by v i. In other words, diag ( v ) x is very computationally efficien t. T o compute diag , we only need to scale each matrix is also efficient. The inv inverse erse exists only if ev every ery diagonal entry is nonzero, x bycase, v . In (/v v)x, . = elemen other words, diag v x.>Inverting a square diagonal ( v) −1 = diag ([1 ([1/v and in tthat diag diag( diag([1 1 . . , 1/vn] ). In many cases, we may matrix is also efficient. inv exists algorithm only ifevery diagonal entry is matrices, nonzero, deriv derivee some very generalThe mac machine hineerse learning in terms of arbitrary ( v) (and = diag ([1descriptiv /v , . . . , 1e) /v algorithm ] ). In many and obtain in thatacase, diagensive cases, we some may but less exp expensive less descriptive) by restricting deriv e some v ery general mac hine learning algorithm in terms of arbitrary matrices, matrices to be diagonal. but obtain a less expensive (and less descriptive) algorithm by restricting some Not all diagonal matrices need be square. It is p ossible to construct a rectangular matrices to be diagonal. diagonal matrix. Non-square diagonal matrices do not hav havee inv inverses erses but it is still Not all diagonal matrices need b e square. It is p ossible to construct a rectangular D , the possible to multiply by them cheaply cheaply.. For a non-square diagonal matrix diagonal matrix. Non-square diagonal matrices do not hav e inv erses but it still pro product duct Dx will in inv volv olvee scaling each element of x , and either concatenatingissome D , last possible cheaply For non-square diagonal matrix the Dthem zeros to to themultiply result ifby is taller than. it is awide, or discarding some of the x , and either concatenating some producttsD willvector involvife scaling eachthan element elemen elements ofxthe D is wider it isoftall. zeros to the result if D is taller than it is wide, or discarding some of the last A symmetric matrixif is matrix thatit isis equal transpose: ose: elemen ts of the vector Dany is wider than tall. to its own transp A= A>is. equal to its own transpose: (2.35) A symmetric matrix is any matrix that Symmetric matrices often arise whenAthe of = entries A . are generated by some function (2.35) two argumen arguments ts that do does es not dep depend end on the order of the arguments. For example, Symmetric matrices often arise when the generated by some function of if A is a matrix of distance measuremen with Aare measurements, ts,entries i,j giving the distance from p oint op argumen ts that do= es A not dep end on the order of the arguments. For example, itwto oin ointt j , then Ai,j j,i b ecause distance functions are symmetric. if A is a matrix of distance measurements, with A giving the distance from point A unit ve vector ctor is a vector with unit norm: i to p oint j , then A = A because distance functions are symmetric. ||x||norm 1. (2.36) 2=1 A unit vector is a vector with unit :. x = 1. (2.36) > A vector x and a vector y are ortho orthogonal each h other if x y = 0. If both || || gonal to eac vectors ha e nonzero norm, this means that they are at a 90 degree angle to each hav v x y x ynonzero A vector and a vector are ortho gonal to eac h other ifwith = 0. If norm. both n other. In R , at most n vectors ma may y b e mutually orthogonal vectors ve nonzero norm, this means that they are at unit a 90 norm, degree we angle each If the vha ectors have ve calltothem R are not only orthogonal but also ha other. In , at most v ectors ma y b e mutually orthogonal with nonzero norm. n orthonormal orthonormal.. If the vectors are not only orthogonal but also have unit norm, we call them An ortho orthogonal gonal matrix is a square matrix whose rows are mutually orthonormal orthonormal . and whose columns are mutually orthonormal: An orthogonal matrix is a square matrix whose rows are mutually orthonormal A>orthonormal: A = AA> = I . (2.37) and whose columns are mutually A A = 41 AA = I .
(2.37)
CHAPTER 2. LINEAR ALGEBRA
This implies that
A −1 = A> ,
(2.38) This implies that so orthogonal matrices are of interest inv inverse erse is very cheap to compute. Abecause = A their , (2.38) Pay careful atten attention tion to the definition of orthogonal matrices. Counterin Counterintuitively tuitively tuitively,, so orthogonal are of interest because their inverse is very cheapistonocompute. their rows arematrices not merely orthogonal but fully orthonormal. There sp special ecial P a y careful atten tion to the definition of orthogonal matrices. Counterin tuitively term for a matrix whose rows or columns are orthogonal but not orthonormal. , their rows are not merely orthogonal but fully orthonormal. There is no special term for a matrix whose rows or columns are orthogonal but not orthonormal.
2.7
Eigendecomp Eigendecomposition osition
Man Many ob objects jects can be understo understoo o d better by breaking them in into to 2.7 y mathematical Eigendecomp osition constituen constituentt parts, or finding some properties of them that are univ universal, ersal, not caused Man y mathematical ob jects can b e understo o d b etter by breaking them into by the way we cho hoose ose to represen representt them. constituent parts, or finding some properties of them that are universal, not caused For wexample, integers tegers can bet decomp decomposed into to prime factors. The wa way y we by the ay we choin ose to represen them. osed in represen representt the num numb ber 12 will change dep depending ending on whether we write it in base ten example, tegers can decomp osed in×to2 ×prime factors. The way we or inFor binary binary, , but itin will alwa always ys bb e etrue that 12 = 22× 3. From this representation represen t the num ber 12prop willerties, change dep oniswhether we write in that base an teny w e can conclude useful properties, suc such h asending that 12 not divisible by 5it, or any 2 or in binary , but it will alwa ys b e true that 12 = 2 3 . F rom this representation in integer teger multiple of 12 will be divisible by 3. we can conclude useful properties, such as that 12×is not × divisible by 5, or that any Muc Much h as we can disco discov ver something ab about out the true nature of an integer by integer multiple of 12 will be divisible by 3. decomp decomposing osing it into prime factors, we can also decomp decompose ose matrices in ways that Muc h as we can disco v er something ab out the true of vious an integer by sho show w us information ab about out their functional prop properties erties thatnature is not ob obvious from the decomp osing we can decompose matrices in ways that represen representation tationitofinto theprime matrixfactors, as an array of also elements. show us information about their functional properties that is not obvious from the One of the most widely used kinds of matrix decomp decomposition osition is called eigenrepresentation of the matrix as an array of elements. de deccomp omposition osition osition,, in whic which h we decomp decompose ose a matrix in into to a set of eigenv eigenvectors ectors and One of the most widely used kinds of matrix decomp osition is called eigeneigen eigenv values. decomposition, in which we decompose a matrix into a set of eigenvectors and An eigenve eigenvector ctor of a square matrix A is a non-zero vector v suc such h that multiplieigen values. cation by A alters only the scale of v: An eigenvector of a square matrix A is a non-zero vector v such that multiplication by A alters only the scale of Av v: = λv. (2.39) Av =corresponding λv . The scalar λ is known as the eigenvalue to this eigen eigenv vector. (2.39) (One > > v A = λ v can also find a left eigenve eigenvector ctor suc such h that , but we are usually concerned λ is known The scalar as the eigenvalue corresponding to this eigenvector. (One with righ rightt eigen eigenv vectors). can also find a left eigenvector such that v A = λv , but we are usually concerned v ist an eigenv eigenvector ector of A, then so is an any y rescaled vector sv for s ∈ R, s 6 = 00.. withIf righ eigen vectors). Moreo Moreov ver, sv still has the same eigenv eigenvalue. alue. For this reason, we usually only look ok R lo v A s v s , s If is an eigenv ector of , then so is an y rescaled vector for = 0. for unit eigen eigenvectors. vectors. Moreover, sv still has the same eigenvalue. For this reason, we usually∈only 6look Supp Suppose ose that a matrix A has n linearly indep independen enden endentt eigenv eigenvectors, ectors, {v (1) , . . . , for unit eigenvectors. corresponding onding eigenv eigenvalues alues {λ1, . . . , λn} . We ma may y concatenate all of the v(n) } , with corresp Suppose that a matrix A has n linearly indep endent eigenvectors, v , . . . , , with corresponding eigenvalues λ42, . . . , λ . We may concatenate{all of the v } { }
CHAPTER 2. LINEAR ALGEBRA
E"ect of eigenvectors and eigenvalues
3 2
Before multiplication Before multiplication
1
2
After multiplication After multiplication
1
v (1)
0
x10
x1
3
v (1)
0
¸ 2v (2) v (2)
(2)
v
−1
−1
−2 −3 −3
¸1 v(1)
−2 −2
−1
0 x0
1
2
3
−3 −3
−2
−1
0 x00
1
2
3
Figure 2.3: An example of the effect of eigen eigenvectors vectors and eigenv eigenvalues. alues. Here, we hav havee a matrix A with tw two o orthonormal eigenv eigenvectors, ectors, v (1) with eigenv eigenvalue alue λ1 and v (2) with Figure 2.3:λ 2An example of the e hav e eigen eigenv value . (L (Left) eft) We plot theeffect set ofofalleigen unitvectors vectorsand a unit Here, circle.w(R (Right) ight) u ∈eigenv R 2 asalues. A v λ v a matrix with tw o orthonormal eigenv ectors, with eigenv alue and with Au A We plot the set of all points . By observing the way that Rdistorts the unit circle, we eft) W λ it. (L (i) unit vectors u eigensee value e plot set of vall as a unit circle. (Right) can that scales space in the direction by λi . Au A We plot the set of all points . By observing the way that∈ distorts the unit circle, we can see that it scales space in direction v by λ .
eigen eigenv vectors to form a matrix V with one eigen eigenv vector per column: V = [[v v(1) , . . . , v(n) ]. Likewise, we can concatenate the eigenv λ1 , . . . , eigenvalues alues to form a vector λ = [[λ eigen v ectors to form a matrix with one eigen v ector p er column: V V = [ v ,..., > λn ] . The eigende eigendeccomp omposition osition of A is then giv given en by v ]. Likewise, we can concatenate the eigenvalues to form a vector λ = [ λ , . . . , −1 λ ] . The eigendecomposition A of = AV is diag then(λgiv by diag( )Ven . (2.40)
A = matrices V diag(λwith )V sp . ecific eigenv (2.40) We hav havee seen that constructing specific eigenvalues alues and eigenv eigenvecectors allo allows ws us to stretch space in desired directions. Ho Howev wev wever, er, we often wan antt to W e hav e seen that c onstructing matrices with sp ecific eigenv alues and eigenv de deccomp ompose ose matrices into their eigen eigenv values and eigenv eigenvectors. ectors. Doing so can help ecus torsanalyze allows certain us to stretch space However, wean often waninto t to to prop properties erties of in thedesired matrix,directions. muc much h as decomp decomposing osing integer decomp osefactors matrices eigenvaluesthe andbehavior eigenvectors. so can help us its prime caninto helptheir us understand of thatDoing in integer. teger. to analyze certain prop erties of the matrix, much as decomposing an integer into Not every matrix can b e decomp decomposed osed in into to eigenv eigenvalues alues and eigenv eigenvectors. ectors. In some its prime factors can help us understand the behavior of that integer. Not every matrix can b e decomposed43into eigenvalues and eigenvectors. In some
CHAPTER 2. LINEAR ALGEBRA
cases, the decomp decomposition osition exists, but ma may y in inv volv olvee complex rather than real numbers. Fortunately ortunately,, in this b ook, we usually need to decomp decompose ose only a sp specific ecific class of cases, the decomp osition exists, but ma y in v olv e complex rather than numbers. matrices that ha have ve a simple decomp decomposition. osition. Sp Specifically ecifically ecifically,, ev every ery realreal symmetric Fortunately this b ook, we need to using decomp osereal-v only alued a specific class of matrix can b, eindecomposed intousually an expression only real-valued eigen eigenvectors vectors matrices that ha ve a simple decomp osition. Sp ecifically , ev ery real symmetric and eigen eigenv values: matrix can b e decomposed into anAexpression = QΛQ>,using only real-valued eigenvectors (2.41) and eigenvalues: where Q is an orthogonal matrixAcomp composed osed eigenvectors ectors of A, and Λ(2.41) is a = QΛ Q ,of eigenv diagonal matrix. The eigen eigenv value Λi,i is asso associated ciated with the eigen eigenv vector in column i Q is anasorthogonal ΛAis as where matrix osed of eigenv ectors , and of a of an orthogonal matrix, we of canAthink Q, denoted Q:,i. Because Q iscomp diagonalspace matrix. eigenvalue Λ(i) is associated with the eigenvector in column i scaling by λThe i in direction v . See Fig. 2.3 for an example. of Q, denoted as Q . Because Q is an orthogonal matrix, we can think of A as A guaran While an any ybreal guaranteed ha hav ve an eigendecomp eigendecomposiosiscaling space y λ symmetric in directionmatrix v . SeeisFig. 2.3 teed for antoexample. tion, the eigendecomp eigendecomposition osition ma may y not be unique. If any tw two o or more eigenv eigenvectors ectors A isofguaran While any real symmetric matrix teed to have an eigendecomp osishare the same eigenv eigenvalue, alue, then an any y set orthogonal vectors lying in their span tion,also theeigenv eigendecomp osition maeigenv y notalue, be unique. any tw o oralently more eigenv are eigenvectors ectors with that eigenvalue, and weIfcould equiv equivalently chooseectors aQ share the same eigenv alue, then an y set of orthogonal vectors lying in their span using those eigenv eigenvectors ectors instead. By con conv ven ention, tion, we usually sort the en entries tries of Λ are also eigenv ectors with that eigenv alue, and w e could equiv alently choose aQ in descending order. Under this conv convention, ention, the eigendecomp eigendecomposition osition is unique only using eigenvalues ectorsare instead. if all ofthose the eigenv eigenvalues unique.By convention, we usually sort the entries of Λ in descending order. Under this convention, the eigendecomposition is unique only The eigendecomp eigendecomposition osition of a matrix tells us man many y useful facts about the if all of the eigenvalues are unique. matrix. The matrix is singular if and only if any of the eigenv eigenvalues alues are 0. The The eigendecomp osition of a matrix tells us man y useful factstoabout the eigendecomp eigendecomposition osition of a real symmetric matrix can also be used optimize matrix. The matrix isof singular if any of the eigenv alues are 0. The x) =only x> Ax x x|| 2 = quadratic expressions the form iff (and sub subject ject to || 11.. Whenever eigendecomp osition of a real symmetric matrix can also be used to optimize is equal to an eigenv eigenvector ector of A, f tak takes es on the value of the corresponding eigenv eigenvalue. alue. f ( x ) = x Ax x x quadratic expressions of the form sub ject to = 1 . Whenever The maxim maximum um value of f within the constrain constraintt region is the maximum eigenv eigenvalue alue A f is equal to a n eigenv ector of , tak es on the v alue of the corresponding eigenv alue. || um eigen and its minim minimum um value within the constraint region is the||minim minimum eigenv value. The maximum value of f within the constraint region is the maximum eigenvalue A matrix whose eigen eigenv values are all positive is called positive definite definite.. A matrix and its minimum value within the constraint region is the minimum eigenvalue. whose eigenv eigenvalues alues are all positiv ositivee or zero-v zero-valued alued is called positive semidefinite semidefinite.. A matrix whose eigen v alues are all p ositive is called p ositive definite . A, matrix Lik Likewise, ewise, if all eigen eigenv values are negativ negative, e, the matrix is ne negative gative definite definite, and if whose eigenv alues are all p ositiv e or zero-v alued is called p ositive semidefinite all eigen eigenv values are negative or zero-v zero-valued, alued, it is ne negative gative semidefinite semidefinite.. Positiv Positivee. Likewise, if all eigenvare alues are negativ e, thethey matrix is negative , and x, x>Ax ≥ 0if. semidefinite matrices interesting because guarantee that ∀definite allositiv eigen alues are negative or zero-valued, it is that negative semidefinite e P ositive e vdefinite matrices additionally guarantee x>Ax = 0 ⇒ x =. 0Positiv . semidefinite matrices are interesting because they guarantee that x, x Ax 0. Positive definite matrices additionally guarantee that x Ax = 0 ∀ x = 0. ≥ 2.8 Singular Value Decomp Decomposition osition ⇒ In , we sa saw w ho how to decomp decompose ose a matrix in into to eigen eigenv vectors and eigen eigenv values. 2.8Sec. 2.7 Singular Vwalue Decomp osition The singular value de deccomp omposition osition (SVD) pro provides vides another wa way y to factorize a matrix, In Sec. 2.7 , w e sa w ho w to decomp ose a matrix in to eigen v ectors and veigen values. in into to singular ve vectors ctors and singular values. The SVD allows us to disco discov er some of The singular value de c omp osition (SVD) pro vides another wa y to factorize a matrix, the same kind of information as the eigendecomp eigendecomposition. osition. Ho How wev ever, er, the SVD is into singular vectors and singular values. The SVD allows us to discover some of the same kind of information as the eigendecomp osition. However, the SVD is 44
CHAPTER 2. LINEAR ALGEBRA
more generally applicable. Every real matrix has a singular value decomp decomposition, osition, but the same is not true of the eigenv eigenvalue alue decomp decomposition. osition. For example, if a matrix more applicable. Every real matrix has a singular alue osition, is not generally square, the eigendecomp eigendecomposition osition is not defined, and wevm ustdecomp use a singular but same is not true of the eigenvalue decomposition. For example, if a matrix v aluethe decomp decomposition osition instead. is not square, the eigendecomposition is not defined, and we must use a singular Recall that the eigendecomp eigendecomposition osition in involv volv volves es analyzing a matrix A to disco discov ver value decomposition instead. a matrix V of eigen eigenv vectors and a vector of eigen eigenv values λ suc such h that we can rewrite Recall that the eigendecomp osition involves analyzing a matrix A to discover A as λ such that we can rewrite a matrix V of eigenvectors andAa = vector of(eigen values V diag diag( λ)V −1 . (2.42) A as A = V diag (λ)V except . (2.42) A The singular value decomp decomposition osition is similar, this time we will write as a product of three matrices: The singular value decomposition is similar, except this time we will write A as a product of three matrices: A = U DV >. (2.43) A = U DV . (2.43) Supp Suppose ose that A is an m × n matrix. Then U is defined to be an m × m matrix, D to be an m × n matrix, and V to be an n × n matrix. Suppose that A is an m n matrix. Then U is defined to be an m m matrix, Eac Each h of these matrices is defined to hav havee a sp special ecial structure. The matrices U D to be an m n matrix, and × V to be an n n matrix. × and V are both defined to b e orthogonal matrices. The matrix D is defined to be × matrices is defined to have a×special structure. The matrices U Each ofmatrix. these a diagonal Note that D is not necessarily square. and V are both defined to b e orthogonal matrices. The matrix D is defined to be The elemen elements ts along the diagonal of D are kno known wn as the singular values of the a diagonal matrix. Note that D is not necessarily square. matrix A. The columns of U are kno known wn as the left-singular ve vectors ctors ctors.. The columns D The elemen ts along the diagonal of are kno wn as the singular values of the of V are kno known wn as as the right-singular ve vectors ctors ctors.. matrix A. The columns of U are known as the left-singular vectors. The columns e can actually the singular value decomposition osition of A in terms of of VWare kno wn as asinterpret the right-singular vectors . decomp the eigendecomposition of functions of A . The left-singular vectors of A are the A in terms Wveectors can actually the singular valueofdecomp osition of of. A are the eigen eigenv of AA> .interpret The righ right-singular t-singular vectors eigen eigenvectors vectors of A>A A A the eigendecomposition of functions of . The left-singular v ectors of are the The non-zero singular values of A are the square ro roots ots of the eigen eigenv values of A>A. AA A eigen v ectors of . The righ t-singular v ectors of are the eigen vectors of A A. > The same is true for AA . The non-zero singular values of A are the square roots of the eigenvalues of A A. Perhaps the most useful feature of the SVD is that we can use it to partially The same is true for AA . generalize matrix in inversion version to non-square matrices, as we will see in the next P erhaps the most useful feature of the SVD is that we can use it to partially section. generalize matrix inversion to non-square matrices, as we will see in the next section.
2.9
The Mo Moore-P ore-P ore-Penrose enrose Pseudoin Pseudoinverse verse
Matrix inv version is ore-P not defined for matrices that areverse not square. Supp Suppose ose we wan wantt 2.9 in The Mo enrose Pseudoin to mak makee a left-inv left-inverse erse B of a matrix A, so that we can solve a linear equation Matrix inversion is not defined for matrices that are not square. Suppose we want to make a left-inverse B of a matrixAx A, = so ythat we can solve a linear equation (2.44) Ax = y 45
(2.44)
CHAPTER 2. LINEAR ALGEBRA
by left-m left-multiplying ultiplying eac each h side to obtain by left-multiplying each side to obtain x = By.
(2.45)
= B y . it ma (2.45)a Dep Depending ending on the structure of the x problem, may y not be possible to design unique mapping from A to B . Depending on the structure of the problem, it may not be possible to design a If A is taller than it is wide, then it is possible for this equation to hav havee unique mapping from A to B . no solution. If A is wider than it is tall, then there could be multiple possible If A is taller than it is wide, then it is possible for this equation to have solutions. no solution. If A is wider than it is tall, then there could be multiple possible The Mo Moor or ore-Penr e-Penr e-Penrose ose pseudoinverse allows us to make some headwa headway y in these solutions. cases. The pseudoinv pseudoinverse erse of A is defined as a matrix The Moore-Penrose pseudoinverse allows us to make some headway in these cases. The pseudoinverse A of+A=islim defined (A>Aas+aαmatrix I ) −1 A> . (2.46) α&0
A = lim (A A + αI ) A . (2.46) Practical algorithms for computing the pseudoinv pseudoinverse erse are not based on this definition, but rather the formula Practical algorithms for computing the pseudoinverse are not based on this definition, but rather the formula A + = V D +U >, (2.47) A value = Vdecomp D U osition , (2.47) A , and the pseudoin where U , D and V are the singular decomposition of ofA pseudoinverse verse + D of a diagonal matrix D is obtained by taking the recipro reciprocal cal of its non-zero V arethe where U and thetransp singular decomp osition of A , and the pseudoinverse elemen elements ts, D then taking transpose osevalue of the resulting matrix. D of a diagonal matrix D is obtained by taking the reciprocal of its non-zero When A has more columns than rows, then solving a linear equation using the elements then taking the transp ose of the resulting matrix. pseudoin pseudoinv verse provides one of the man many y possible solutions. Specifically Specifically,, it pro provides vides A When has more columns than rows, then solving a linear equation using the + the solution x = A y with minimal Euclidean norm ||x||2 among all possible pseudoinverse provides one of the many possible solutions. Specifically, it provides solutions. the solution x = A y with minimal Euclidean norm x among all possible When A has more rows than columns, it is possible for there to be no solution. solutions. || || In this case, using the pseudoinv pseudoinverse erse gives us the x for which Ax is as close as WhentoAy has more rows than columns, is p−ossible possible in terms of Euclidean norm ||itAx y||2 . for there to be no solution. In this case, using the pseudoinverse gives us the x for which Ax is as close as possible to y in terms of Euclidean norm Ax y . 2.10 The Trace Op Operator erator || − || The operator gives Op the sum of all of the diagonal en entries tries of a matrix: 2.10traceThe Trace erator X The trace operator gives the sum entries of a matrix: (2.48) Tr(ofAall ) =of theAdiagonal i,i . i
Tr(A) = A . (2.48) The trace op operator erator is useful for a variety of reasons. Some op operations erations that are difficult to sp specify ecify without resorting to summation notation can b e sp specified ecified using The trace operator is useful for a variety of reasons. Some operations that are 46X difficult to specify without resorting to summation notation can b e specified using
CHAPTER 2. LINEAR ALGEBRA
matrix pro products ducts and the trace op operator. erator. For example, the trace op operator erator provides an alternativ alternativee way of writing the Frob robenius enius norm of a matrix: matrix products and the trace operator. For example, the trace operator provides q an alternative way of writing the Frobenius norm of a matrix: ||A||F = Tr( r(AA AA> ). (2.49)
A = Tr(AA ). (2.49) Writing an expression in terms of the trace op operator erator op opens ens up opp opportunities ortunities to || ||man manipulate the expression using many y useful identities. F For or example, the trace W riting an expression in terms of the trace operator opens up opp ortunities to op operator erator is in inv varian ariantt to the transp transpose ose op operator: erator: manipulate the expression using manyquseful identities. For example, the trace operator is invariant to the transp Tr(ose A) op = erator: T Tr( r(A> ). (2.50)
Tr(A) = Tr(A ). (2.50) The trace of a square matrix comp composed osed of many factors is also in inv varian ariantt to mo moving ving the last factor into the first p osition, if the shap shapes es of the corresp corresponding onding The trace of a square matrix comp osed of many factors is also in v arian t to matrices allo the resulting pro to b e defined: allow w product duct moving the last factor into the first p osition, if the shapes of the corresp onding matrices allow the resulting pro toCbAB e defined: Tr(AB ABC C )duct =T Tr( r( )=T Tr( r(B C A) (2.51) or more generally generally,, or more generally,
Tr(AB C ) = Tr(C AB) = Tr(B C A) n−1 n Y Y Tr( F (i)) = T Tr( r( r(F F (n) F (i) ). i=1
(2.51) (2.52)
i=1
Tr( F ) = Tr(F F ). (2.52) This inv invariance ariance to cyclic perm ermutation utation holds even if the resulting pro product duct has a differen differentt shap shape. e. For example, for A ∈ Rm×n and B ∈ R n×m, we ha hav ve This invariance to cyclic permutation holds even if the resulting product has a R R different shape. For example, for A B , we have Tr( r(and BA (2.53) YTr(AB ) = T Y) ∈ ∈ n×T n r(B A) AB (2.53) . ev even en though AB ∈ Rm×m and TBr(A ∈ )R= R mindRis that . a scalar is its own trace: a = Tr(a ). evenAnother though useful AB fact to keep and in BA ∈ is that a scalar is its own trace: a = Tr(a ). Another useful∈fact to keep in mind
2.11
The Determinan Determinantt
The of a square matrix, denoted det det((A ), is a function mapping 2.11determinant The Determinan t matrices to real scalars. The determinant is equal to the pro product duct of all the The determinant of a square ), is a function eigen eigenv v alues of the matrix. The matrix, absolute denoted value of det the(A determinant can bemapping thought matrices to real scalars. The determinant is equal to the pro duct of all the of as a measure of how muc uch h multiplication by the matrix expands or con contracts tracts eigen v alues of the matrix. The absolute v alue of the determinant can b e thought space. If the determinant is 0, then space is contracted completely along at least of asdimension, a measure causing of how m multiplication by the If matrix expands or is con one it uc tohlose all of its volume. the determinant 1,tracts then space. If the determinant is 0, then space is contracted completely along at least the transformation is volume-preserving. one dimension, causing it to lose all of its volume. If the determinant is 1, then the transformation is volume-preserving. 47
CHAPTER 2. LINEAR ALGEBRA
2.12
Example: Principal Comp Componen onen onents ts Analysis
One mac machine hine learning algorithm,Comp princip principal alonen comp omponents onents analysis or PCA can 2.12simple Example: Principal ts Analysis be deriv derived ed using only kno knowledge wledge of basic linear algebra. One simple machine learning algorithm, principal(1) components analysis or PCA can m poin , . . . , x (m)} in Rn . Supp Supp Suppose oseusing we ha hav ve akno collection of basic oints ts {xalgebra. Suppose ose we be deriv ed only wledge of linear would like to apply lossy compression to these poin Lossy compression means oints. ts. R m , . . . , x x Supp ose w e ha v e a collection of p oin ts in . Supp ose we storing the points in a wa way y that requires less memory but ma may y lose some precision. would like lik toeapply compression { points. Lossy} compression means W e would like to loselossy as little precisiontoasthese possible. storing the points in a way that requires less memory but may lose some precision. walik yw enco encode these points is represen representt a lo lower-dimensional wer-dimensional version We One would e etocan lose as de little precision asto possible. of them. For each point x(i) ∈ R n we will find a corresponding co code de vector c (i) ∈ R l. way wthan e can nenco points is to represent a lower-dimensional version If l One is smaller , it de willthese take code de points than the R less memory to store the co R x c of them. F or each p oint w e will find a corresponding co de vector original data. We will wan antt to find some enco encoding ding function that pro produces duces the co code de. l n If is smaller than , it will take less memory to store the co de p oints than the ∈ for an input, f (x) = c, and∈a deco decoding ding function that pro produces duces the reconstructed original data. W e will w an t to find some enco ding function that pro duces the co de input giv given en its co code, de, x ≈ g(f (x)). for an input, f (x) = c, and a decoding function that produces the reconstructed PCA is defined by xour cghoice decoding ding function. Sp Specifically ecifically ecifically,, to mak makee the input given its co de, (f (x))of. the deco deco decoder der very simple, we choose to use matrix multiplication to map the co code de back ≈ choice of the ndeco PCA is defined b y our ×l ding function. Sp ecifically, to make the n in into to R . Let g (c) = Dc, where D ∈ R is the matrix defining the deco decoding. ding. decoder very simple, we choose to use matrix multiplication to map the code back R R deco Computing the optimal co code de for this decoder der could be a difficult problem. To into is the matrix defining the decoding. . Let g (c) = Dc, where D D to be orthogonal keep the enco encoding ding problem easy easy,, PCA constrains the colum columns ns of ofD ∈ this decoder could be a difficult problem. To Computing the optimal co de for to eac each h other. (Note that D is still not technically “an orthogonal matrix” unless k eep the l = n) encoding problem easy, PCA constrains the columns of D to be orthogonal to each other. (Note that D is still not technically “an orthogonal matrix” unless With the problem as describ described ed so far, man many y solutions are possible, because we l = n) can increase the scale of D:,i if we decrease c i prop proportionally ortionally for all poin oints. ts. To giv givee With the problem as describ ed so far, man y solutions are p ossible, b ecause we the problem a unique solution, we constrain all of the columns of D to ha hav ve unit can increase the scale of if we decrease prop ortionally for all p oin ts. T o giv e D c norm. the problem a unique solution, we constrain all of the columns of D to have unit In order to turn this basic idea in into to an algorithm we can implement, the first norm. thing we need to do is figure out how to generate the optimal co code de point c∗ for orderpto turn idea to an algorithm wethe candistance implement, the x . this eac each hIninput oint One basic way to do in this is to minimize betw etween een first the c thing w e need to do is figure out how to generate the optimal co de p oint ∗ input point x and its reconstruction, g( c ). We can measure this distance usingfor a x . One each input point wayonents to do algorithm, this is to minimize theL2distance norm. In the principal comp components we use the norm: between the input point x and its reconstruction, g( c ). We can measure this distance using a norm. In the principal components algorithm, we use the L norm: c∗ = arg min ||x − g(c)||2 . (2.54) c
c = arg min x g(c) . (2.54) We can switch to the squared L 2 norm instead of the L2 norm itself, b ecause both are minimized by the same value of|| c .−This || is b ecause the L 2 norm is nonL L norm W e can switch to the squared norm instead of the b ecause negativ negativee and the squaring op operation eration is monotonically increasing foritself, non-negative both are minimized by the same value of c . This is b ecause the L norm is nonnegative and the squaring operation is monotonically increasing for non-negative 48
CHAPTER 2. LINEAR ALGEBRA
argumen arguments. ts.
c∗ = arg min ||x − g(c)||22 .
c arguments. c = arg min tox g(c) . The function being minimized simplifies || − || − g(c))>to (x − g(c)) The function being minimized(xsimplifies
(x Eq. g(c))2.30 (x) g(c)) (b (by y the definition of the L2 norm, − − − x>gEq. (c) − g (c) )>x + g(c) >g(c) = xL> xnorm, (by the definition of the 2.30 = x x x g(c) g (c) x + g(c) g(c) (b (by y the distributiv distributivee property) − − = x>x − 2x > g(c) + g (c)>g(c) (by the distributive property)
(2.55) (2.55) (2.56) (2.56) (2.57) (2.57) (2.58)
x 2to x the = is x equal g(c)transp + g (cose ) g(ofc)itself (2.58) (b (because ecause the scalar g(x)> x transpose itself). ). − being minimized again, to omit the first term, We can now change the function (because the scalar g(x) x is equal to the transpose of itself ). since this term do does es not dep depend end on c: We can now change the function being minimized again, to omit the first term, c∗ dep = arg since this term do es not endmin on−c2: x> g(c) + g (c)>g(c). (2.59) c
c = arg min 2x g(c) + g (c) g(c). (2.59) To mak makee further progress, we must substitute in the definition of g(c): − ∗ > > To make further progress, −2x c = argwe min Dc + c>inDthe must substitute Dcdefinition of g(c): (2.60) c
c = arg min 2x >Dc + c >D Dc = arg min −2x Dc + c Ilc − c = arg min c IDc) 2x Dc +on (b (by y the orthogonalit orthogonality y and unit norm constraints − > = arg minconstraints c >D −2x Dc +on (by the orthogonality and unit norm c)
(2.60) (2.61) (2.61) (2.62)
c
= arg min 2x Dc + c c (2.62) We can solve this optimization problem using vector calculus (see Sec. 4.3 if − you do not know how to do this): We can solve this optimization problem using vector calculus (see Sec. 4.3 if > > you do not know how to do ∇ this): (2.63) c (−2x D c + c c) = 0 c)0= 0 (2.63) Dxc++2cc = ( −22xD> (2.64) ∇ − c = D >x. (2.65) 2D x + 2c = 0 (2.64) − c = Dw . optimally enco (2.65)a This mak makes es the algorithm efficient: wee xcan encode de x just using matrix-v op T o enco a vector, we apply the enco function matrix-vector ector operation. eration. encode de encoder der This makes the algorithm efficient: we can optimally encode x just using a f (xa)vector, = D > xwe matrix-vector operation. To encode . apply the encoder function(2.66) 49 D x. f (x) =
(2.66)
CHAPTER 2. LINEAR ALGEBRA
Using a further matrix multiplication, we can also define the PCA reconstruction op operation: eration: Using a further matrix multiplication, also x. the PCA reconstruction (2.67) r(x) = g (f (wxe))can =D D>define operation: Next, we need to choose encoding the x.. To do so, we revisit (2.67) r(xthe ) = enco g (f (ding x)) =matrix DD D 2 idea of minimizing the L distance bet etw ween inputs and reconstructions. How However, ever, D Next, w e need to choose the enco ding matrix . T o do so, w e revisit the since we will use the same matrix D to deco decode de all of the points, we can no longer idea of minimizing between inputsminimize and reconstructions. ever, L distance consider the points the in isolation. Instead, we must the Frob robenius eniusHow norm of D since w e will use the same matrix to deco de all of the p oints, we can no longer the matrix of errors computed ov over er all dimensions and all points: consider the points in isolation. Instead, we must minimize the Frob enius norm of the matrix of errors computed over all dimensions and all points: s 2 X (i) D ∗ = arg min xj − r (x(i))j sub subject ject to D> D = Il (2.68) D
i,j
D = arg min x r (x ) sub ject to D D = I (2.68) ∗ D , we will start by considering the case To derive the algorithm for finding − D where l = 1 In this case, is just a single vector, d. Substituting Eq. 2.67 in 1.. into to D T o derive the algorithm for finding , we will startto by considering the case Eq. 2.68 and simplifying D in into to d , the problem reduces s D where l = 1. In this case, X is just a single vector, d. Substituting Eq. 2.67 into Xinto d, the problem reduces to Eq. 2.68 and simplifying D || ||x x(i) − dd>x(i) || 22 sub subject ject to ||d||2 = 11.. d ∗ = arg min (2.69) d
i
x dd x sub ject to d = 1. d = arg min (2.69) The ab abo ove form formulation ulation is the most direct wa way y of performing the substitution, || − || || || but is not the most stylistically pleasing way to write the equation. It places the Thevalue abovde>form is theofmost erforming the substitution, scalar on the right the vdirect ector wa con conv ven entional tional to write x (i) ulation d. yItofispmore but is not the most stylistically pleasing w a y to write the equation. It places the scalar co coefficients efficients on the left operate erate on. We therefore usually write Xof vector they op scalar alueula d x suc such h a vform formula as on the right of the vector d. It is more conventional to write scalar coefficients on the left Xof vector they operate on. We therefore usually write || ||x x(i) − d>x(i) d|| 22 sub subject ject to ||d||2 = 11,, d∗ = (2.70) such a formula as arg min d
i
x d x d sub ject to d = 1, d = arg min or, exploiting the fact that a scalar is its own transp transpose, ose, as || − || || || X ∗ (i)> (i) 2 d = arg min || ||x x − x dd dd|| || sub subject ject to || 1.. or, exploiting the fact that a scalar is its own 2transpose, as d||2 = 1 d
(2.70)
(2.71)
i
d = arg min X x x dd sub ject to d = 1. (2.71) The reader should aim to become familiar with such cosmetic rearrangements. || − || || || At this point, it can be helpful to rewrite the problem in terms of a single The reader should aim to become familiar with such cosmetic rearrangements. design matrix of examples, rather than as a sum ov over er separate example vectors. A t this p oint, it can b e helpful to rewrite the problem of amatrix single n b e the X This will allo allow w us to use more compact notation. Let X ∈ Rinm×terms design matrix of examples, rather than as a sum ov er separate example vectors. (i)> defined by stacking all of the vectors describing the poin oints, ts, such R that X i,: = x . This willnow allorewrite w us tothe useproblem more compact notation. Let X be the matrix W e can as defined by stacking all of the vectors describing the points,∈such that X = x . > 2 d ∗ =the arg problem min ||X − subject ject to d> d = 11.. (2.72) We can now rewrite asX dd ||F sub d
d = arg min X X dd50 sub ject to d d = 1. || − ||
(2.72)
CHAPTER 2. LINEAR ALGEBRA
Disregarding the constraint for the moment, we can simplify the Frob robenius enius norm portion as follo follows: ws: Disregarding the constraint for norm ||X − X we dd>can ||2F simplify the Frobenius(2.73) argthe minmoment, portion as follows: d min X Xdd (2.73) arg > > > = arg min Tr X− X dd X − X dd (2.74) || − || d
(b (by y Eq. 2.49)
= arg min Tr
X
X dd
X
X dd
(2.74)
− − > > > > > (by Eq. 2.49 ) = arg min Tr( r(X X X − X X dd − dd X X + dd> X >X dd>) (2.75) d dd X X + dd X X dd ) = arg min Tr(X X X X dd (2.75) > > > > > = arg min Tr( r(X X X ) − Tr(X X dd ) − Tr(dd X X ) + Tr( r(dd dd>X > X dd> ) − − d (2.76) = arg min Tr(X X ) Tr(X X dd ) Tr(dd X X ) + Tr(dd X X dd ) > > > > > > > = arg min − Tr( r(dd dd X X dd ) (2.77) r(X X −X dd ) − Tr(dd −X X ) + Tr( (2.76) d = argterms min not X ddd)do T X the X )arg +T r(dd (2.77) Tr(in X r(dd (b (because ecause inv volving not affect min ) X X dd ) − − > affect > )> (because terms not in v olving d do the arg r(X Xnot = arg min −2 Tr( X dd>) + Tr( r(dd ddmin X X dd>) (2.78) d
= arg min 2 Tr(X X dd ) + Tr(dd X X dd ) (b (because ecause we can cycle the order of the matrices inside a trace, Eq. 2.52) − > matrices > > (because we can=cycle the − order of a dd trace, arg min X dd>) + Tinside r( r(X X>X ddEq. 2 Tr( r(X X the ) 2.52)
(2.78) (2.79)
d
= arg min 2 Tr(X X dd ) + Tr(X X dd dd ) (using the same prop propert ert erty y again) − At this poin oint, t, we re-in re-intro tro troduce duce the constrain constraint: t: (using the same prop erty again)
(2.79)
> duce the > A t this w tro t: >) sub subject ject to d > d = 1 2 Tt,r( r(X Xe >re-in arg min p−oin X dd ) + Tr( r(X X constrain X dd>dd
(2.80)
arg min 2 Tr(X X>dd ) +> Tr(X X>dd dd> ) sub ject to d> d = 1 = arg min −2 Tr( r(X X X dd ) sub r(X X X dd ) + Tr( subject ject to d d = 1 − d
(2.80) (2.81)
d
= arg 2 Tr(X X dd ) + Tr(X X dd ) sub ject to d d = 1 (due to the min constraint) − (due to the constraint) = arg min − Tr( r(X X > X dd>) sub subject ject to d>d = 1
(2.81) (2.82)
d
= arg min Tr(X X dd ) sub ject to d d = 1 = arg max Tr( r(X X > X dd> ) sub subject ject to d> d = 1 − d = arg max Tr(X> X>dd ) sub ject to d> d = 1 = arg max Tr( r(d d X X d) sub subject ject to d d = 1
(2.82) (2.83) (2.83) (2.84)
d
= arg max Tr(d X X d) sub ject to d d = 1 51
(2.84)
CHAPTER 2. LINEAR ALGEBRA
This optimization problem ma may y be solved using eigendecomp eigendecomposition. osition. Sp Specifically ecifically ecifically,, the optimal d is given by the eigen eigenv vector of X >X corresp corresponding onding to the largest This optimization problem may be solved using eigendecomposition. Specifically, eigen eigenv value. the optimal d is given by the eigenvector of X X corresponding to the largest Invalue. the general case, where l > 1, the matrix D is given by the l eigen eigenv vectors eigen corresp corresponding onding to the largest eigenv eigenvalues. alues. This may be shown using pro proof of by l > 1this D is In the general case, where , thepro matrix given by the l eigenvectors induction. We recommend writing proof of as an exercise. corresponding to the largest eigenvalues. This may be shown using proof by Linear algebra is one of the fundamen fundamental tal mathematical disciplines that is induction. We recommend writing this proof as an exercise. necessary to understand deep learning. Another key area of mathematics that is Linear algebra is one of the fundamentaltheory mathematical that is ubiquitous in mac machine hine learning is probability theory, , presenteddisciplines next. necessary to understand deep learning. Another key area of mathematics that is ubiquitous in machine learning is probability theory, presented next.
52
Chapter 3 Chapter 3
Probabilit Probability y and Information Theory Probability and Information Theory describee probabilit In this chapter, we describ probability y theory and information theory theory.. In this chapter, we describe probability theory and information theory.
Probabilit Probability y theory is a mathematical framew framework ork for represen representing ting uncertain In this chapter, we describe probability theory and information theory. statemen It pro a means of quan uncertain y and axioms for deriving statements. ts. provides vides quantifying tifying uncertaintt Probabilit y theory is a mathematical framew ork for represen ting uncertain new uncertain statements. In artificial intelligence applications, we use probability statemen projor vides means of the quanlaws tifying uncertaintyy tell and us axioms for systems deriving theory ints. twIt o ma major waays. First, of probabilit probability how AI new uncertain statements. In artificial intelligence applications, w e use probability should reason, so we design our algorithms to compute or approximate various theory in twderiv o maed jorusing ways.probabilit First, the laws of. Second, probabilit tell use us how AI systems expressions derived probability y theory theory. wey can probability and should reason, so we design our algorithms toofcompute orAI approximate statistics to theoretically analyze the beha ehavior vior prop proposed osed systems. various expressions derived using probability theory. Second, we can use probability and Probabilit Probability y theory is analyze a fundamental tool ol ofof man many disciplines of science and statistics to theoretically the behato vior propyosed AI systems. engineering. We provide this chapter to ensure that readers whose background is Probabilit y theory is a fundamental tool of man y disciplines of science primarily in soft with limited exp to probability theory and can softw ware engineering exposure osure engineering. W e provide this chapter to ensure that readers whose background is understand the material in this book. primarily in software engineering with limited exposure to probability theory can While probabilit probability y theory allows us to make uncertain statements and reason understand the material in this book. in the presence of uncertaint uncertainty y, information allows us to quan quantify tify the amount of While probabilit y theory allows us to make uncertain statements and reason uncertain y in a probabilit distribution. uncertaintt probability y in the presence of uncertainty, information allows us to quantify the amount of If you are already familiar with probability theory and information theory theory,, uncertainty in a probability distribution. you ma may y wish to skip all of this chapter except for Sec. 3.14, which describ describes es the If y ou are already familiar with probability theory and information theory graphs we use to describ describee structured probabilistic mo models dels for machine learning. If, you ou hav maye wish to skipnoallprior of this chapter for Sec. 3.14, which describshould es the y have absolutely exp experience erience except with these sub subjects, jects, this chapter graphs we usetotosuccessfully describe structured moresearch dels for machine If b e sufficient carry outprobabilistic deep learning pro projects, jects,learning. but we do you havethat absolutely no prior experienceresource, with these this chapter suggest you consult an additional suc such hsub asjects, Ja Jaynes ynes (2003 ). should be sufficient to successfully carry out deep learning research pro jects, but we do suggest that you consult an additional resource, such as Jaynes (2003). 53 53
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.1
Wh Why y Probabilit Probability? y?
Man Many branches of computer science entirely tirely 3.1 y branc Whhes y Probabilit y? deal mostly with entities that are en deterministic and certain. A programmer can usually safely assume that a CPU will Many branc of computer science deal .mostly entities do that are en tirely execute eac each hhes machine instruction flawlessly flawlessly. Errorswith in hardware occur, but are deterministic and certain. A programmer can usually safely assume that a CPU will rare enough that most softw software are applications do not need to be designed to account execute h machine instruction flawlessly . Errors hardware do occur, butinare for them.eacGiv Given en that man many y computer scientists andinsoftw software are engineers work a rare enough that most softw are applications do not need to b e designed to account relativ relatively ely clean and certain environmen environment, t, it can be surprising that mac machine hine learning for them. Giv en that man y computer scientists and softw are engineers work in a mak makes es hea heavy vy use of probabilit probability y theory theory.. relatively clean and certain environment, it can be surprising that machine learning isvy because learning alwa always ys deal with uncertain quantities, makThis es hea use of machine probabilit y theorymust . and sometimes may also need to deal with sto stocchastic (non-deterministic) quan quantities. tities. This is b ecause machine learning must alwa ys deal with uncertain quantities, Uncertain Uncertaintty and sto stocchasticit hasticity y can arise from man many y sources. Researc Researchers hers ha hav ve made and sometimes may also need to deal with sto c hastic (non-deterministic) quan comp compelling elling argumen arguments ts for quantifying uncertaint uncertainty y using probability since at tities. least Uncertain t y and sto c hasticit y can arise from man y sources. Researc hers ha v e made the 1980s. Man Many y of the arguments presented here are summarized from or inspired comp elling argumen ts for quantifying uncertainty using probability since at least b y Pearl (1988 ). the 1980s. Many of the arguments presented here are summarized from or inspired Nearly all activities require some ability to reason in the presence of uncertaint uncertainty y. by Pearl (1988). In fact, beyond mathematical statements that are true by definition, it is difficult Nearly activities require some ability to reason theeven presence y. to think ofall any prop that is absolutely true orinany absolutely proposition osition event t thatofisuncertaint In fact,teed beyond mathematical statements that are true by definition, it is difficult guaran guaranteed to occur. to think of any proposition that is absolutely true or any event that is absolutely There are three possible sources of uncertain uncertaintty: guaranteed to occur. There are three possible sources of uncertainty: 1. Inheren Inherentt stochasticit stochasticity y in the system being mo modeled. deled. For example, most in interpretations terpretations of quantum mechanics describ describee the dynamics of subatomic 1. particles Inherent as stochasticit y in the system b eingcreate modeled. For example, being probabilistic. We can also theoretical scenariosmost that ineterpretations quantum mechanics describ dynamics of card subatomic w postulate toofha hav ve random dynamics, such easthe a hypothetical game particles being probabilistic. e can alsosh create theoretical scenarios where weasassume that the cardsWare truly shuffled uffled in into to a random order.that we postulate to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuffled intocan a random order. 2. Incomplete observ observability ability ability. . Ev Even en deterministic systems app appear ear sto stochastic chastic when we cannot observ observee all of the variables that drive the behavior of the 2. system. Incomplete observ ability EvMont en deterministic systems can sho appwear sto chastic For example, in .the Monty y Hall problem, a game show con contestan testan testant t is when w e cannot observ e all of the v ariables that drive the b ehavior of the ask asked ed to choose betw etween een three do doors ors and wins a prize held behind the chosen system. Foordo example, Montwhile y Halla problem, a game showThe contestan t is do door. or. Tw doors ors leadintothe a goat third leads to a car. outcome ask ed the to choose betw threeis do ors and winsbut a prize ehind thet’schosen giv given en contestan contestant’s t’seen choice deterministic, fromheld the bcon contestan testan testant’s poin ointt do or. T w o do ors lead to a goat while a third leads to a car. The outcome of view, the outcome is uncertain. given the contestant’s choice is deterministic, but from the contestant’s point of view, the mo outcome uncertain. 3. Incomplete modeling. deling.is When we use a mo model del that must discard some of the information we hav havee observ observed, ed, the discarded information results in 3. uncertain Incomplete mo deling. When we use a mo that must discard uncertaintty in the mo model’s del’s predictions. Fordel example, supp suppose ose we some build of a the we hav e observ the discarded information in rob robot otinformation that can exactly observe theed, lo location cation of every ob object ject aroundresults it. If the uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly observe the54location of every ob ject around it. If the
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
rob robot ot discretizes space when predicting the future lo location cation of these ob objects, jects, then the discretization makes the robot immediately become uncertain ab about out rob ot discretizes space when predicting the future lo cation of these ob jects, the precise position of ob objects: jects: eac each h ob object ject could be anywhere within the then the discretization makes the robot immediately become uncertain about discrete cell that it was observ observed ed to occup ccupy y. the precise position of ob jects: each ob ject could be anywhere within the discrete cell that it was observed to occupy. In man many y cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, ev even en if the true rule is deterministic and our In man y cases, is more practical to use a simple but rule. uncertain rule rather mo modeling deling system hasitthe fidelit fidelity y to accommo accommodate date a complex For example, the than a complex but certain one, ev en if the true rule is deterministic and our simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule mothe deling system hasfly the fidelityfor to vaccommo date a complex rule. Foryet example, of form, “Birds fly, , except ery young birds that ha hav ve not learnedthe to simple rule “Most birds fly” is c heap to develop and is broadly useful, while a rule fly fly,, sick or injured birds that hav havee lost the abilit ability y to fly fly,, flightless sp species ecies of birds of the form, fly ,,except ery y.oung that to hadev ve not learned to including the “Birds cassow cassowary ary ary, ostric ostrich h for andvkiwi. . ” is birds exp expensive ensive develop, elop,yet main maintain tain and fly, sick or injured thatofhav lost the abilitvyery to brittle fly, flightless species of birds comm communicate, unicate, and birds after all thise effort is still and prone to failure. including the cassowary, ostrich and kiwi. . . ” is expensive to develop, maintain and Giv Given en that we need a means of representing and reasoning ab about out uncertaint uncertainty y, communicate, and after all of this effort is still very brittle and prone to failure. it is not immediately ob obvious vious that probabilit probability y theory can provide all of the to tools ols Given that we need a means ofapplications. representing Probability and reasoning about uncertaint y, we wan want t for artificial in intelligence telligence theory was originally it iselop noted immediately that probabilit yts.theory can provide all of the tools dev develop eloped to analyze ob thevious frequencies of ev even en ents. It is easy to see how probability we wantcan for bartificial intelligence applications. Probability originally theory e used to study ev even en ents ts like dra drawing wing a certaintheory hand w ofascards in a dev elop ed to analyze the frequencies of ev en ts. It is easy to see how probability game of poker. These kinds of even events ts are often rep repeatable. eatable. When we sa say y that theory can bhas e used to study ev like drawing a certain cards in p en an outcome a probability oftsoccurring, it means that hand if we of repeated thea game of poker. ts are often rep eatable. we sa y that p exp experimen erimen eriment t (e.g.,These draw akinds handofofeven cards) infinitely man many y times,When then prop proportion ortion p an outcome has a probability of o ccurring, it means that if w e repeated the of the rep repetitions etitions would result in that outcome. This kind of reasoning do does es not p exp erimen t (e.g., draw a hand of cards) infinitely man y times, then prop ortion seem immediately applicable to prop propositions ositions that are not rep repeatable. eatable. If a do doctor ctor of the repaetitions result in the thatpatient outcome. kind of reasoning es not analyzes patientwould and sa says ys that has This a 40% chance of havingdothe flu, seem immediately applicable to prop ositions that are not rep eatable. If a do ctor this means something very different—w different—wee can not mak makee infinitely man many y replicas of analyzes a patient and sa ys that the patient has a 40% chance of having flu, the patient, nor is there an any y reason to believe that differen differentt replicas of the the patien patient t this means something very different—w e can not mak e infinitely man y replicas of would present with the same symptoms yet hav havee varying underlying conditions. In the patient, nor do is there an y reason to b elieve that t replicasto of represent the patienat the case of the diagnosing the patient, we differen use probability doctor ctor would the1 same symptoms yet hav e varying underlying conditions. In de degr gr greee present of belief elief,with , with indicating absolute certaint certainty y that the patient has the flu the case of the doabsolute ctor diagnosing patient, we usedo probability represent a and 0 indicating certain certainttythe that the patient does es not hav haveetothe flu. The de gr e e of b elief , with 1 indicating absolute certaint y that the patient has the flu former kind of probability probability,, related directly to the rates at which even events ts occur, is and 0 indicating absolute certain t y that the patient do es not hav e the The kno known wn as fr freequentist pr prob ob obability ability ability,, while the latter, related to qualitative flu. lev levels els of former of probability , related directly certain certainttkind y, is known as Bayesian pr prob ob obability ability ability.. to the rates at which events occur, is known as frequentist probability, while the latter, related to qualitative levels of If we list several properties that we expect common sense reasoning ab about out certainty, is known as Bayesian probability. uncertain uncertaintty to ha hav ve, then the only wa way y to satisfy those prop properties erties is to treat If w e list several properties that w e expect common sense reasoning about Ba Bay yesian probabilities as beha ehaving ving exactly the same as frequentist probabilities. uncertain ty to haevw e,an then the onlythe wayprobabilit to satisfy those prop is to F or example, if w ant t to compute probability y that a play player ererties will win a ptreat oker Ba y esian probabilities as b eha ving exactly the same as frequentist probabilities. game giv given en that she has a certain set of cards, we use exactly the same form formulas ulas F or example, if w e w an t to compute the probabilit y that a play er will win a p oker as when we compute the probabilit probability y that a patien patientt has a disease giv given en that she game given that she has a certain set of cards, we use exactly the same formulas 55 a patient has a disease given that she as when we compute the probability that
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
has certain symptoms. For more details ab about out why a small set of common sense assumptions implies that the same axioms must con control trol both kinds of probability probability,, has certain symptoms. F or more details ab out why a small set of common sense see Ramsey (1926). assumptions implies that the same axioms must control both kinds of probability, Probability y can).be seen as the extension of logic to deal with uncertaint uncertainty y. Logic see Probabilit Ramsey (1926 pro provides vides a set of formal rules for determining what prop propositions ositions are implied to Probabilit y can be the seenassumption as the extension logicother to deal uncertaint be true or false given that of some set with of prop propositions ositionsy.isLogic true pro vides a set of formal rules for determining what prop ositions are implied to or false. Probabilit Probability y theory provides a set of formal rules for determining the b eeliho trueoor false given the assumption that other of other propositions is true lik likeliho elihoo d of a prop proposition osition being true giv given en some the lik likeliho eliho elihoo oset d of prop propositions. ositions. or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.
3.2
Random Variables
A random variable isV aariables variable that can take on differen differentt values randomly randomly.. W Wee 3.2 Random typically denote the random variable itself with a low lower er case letter in plain typeface, A r andom variable is a v ariable that can take on differen values randomly . W x1 and x2e and the values it can take on with low lower er case script letters. tFor example, typically denote random ariable itselfvwith a low er case typeface, are both p ossiblethe values thatvthe random ariable x can takeletter on. Fin or plain vector-v vector-valued alued x x and the v alues it can take on with low er case script letters. F or example, and variables, we would write the random variable as x and one of its values as x. On are b oth p ossible v alues that the random v ariable x can take on. F or vector-v alued its own, a random variable is just a description of the states that are possible; it x. On variables, we would write the randomdistribution variable as xthat andsp one of its values aseach m ust be coupled with a probability specifies ecifies how likely of its own, a random these states are. variable is just a description of the states that are possible; it must be coupled with a probability distribution that specifies how likely each of Random variables may be discrete or contin continuous. uous. A discrete random variable these states are. is one that has a finite or countably infinite num umb ber of states. Note that these Random v ariables may b e discrete or contin uous. discrete random variable states are not necessarily the integers; they can also Ajust be named states that is one has a finite ore any countably infinite num er of states. Note that these are notthat considered to hav have numerical value. Abcontin continuous uous random variable is states are not necessarily the integers; they can also just b e named states that asso associated ciated with a real value. are not considered to have any numerical value. A continuous random variable is associated with a real value.
3.3
Probabilit Probability y Distributions
A prob ob obability ability distribution is a description of how likely a random variable or 3.3pr Probabilit y Distributions set of random variables is to take on each of its possible states. The way we A probeability distribution is a description how likely a random ariable or or describ describe probability distributions dep depends ends onofwhether the variables arevdiscrete set of uous. random variables is to take on each of its possible states. The way we con contin tin tinuous. describe probability distributions depends on whether the variables are discrete or continuous.
3.3.1
Discrete Variables and Probabilit Probability y Mass Functions
3.3.1 Discrete Variables and Probabilit y Mass Functions A probabilit probability y distribution ov over er discrete variables ma may y be describ described ed using a pr prob ob obaability mass function (PMF). We typically denote probabilit probability y mass functions with a A probabilit y distribution ov er discrete v ariables ma y b e describ ed using a probacapital P . Often we asso associate ciate each random variable with a different probability bility mass function (PMF). We typically denote probability mass functions with a 56 capital P . Often we associate each random variable with a different probability
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
mass function and the reader must infer which probability mass function to use based on the identit identity y of the random variable, rather than the name of the function; mass function and the ust which probability mass function to use P (x) is usually not the reader same asmP (y)infer . based on the identity of the random variable, rather than the name of the function; probabilit probability mass function from a state of a random variable to P (xThe ) is usually notythe same as P (ymaps ). the probabilit probability y of that random variable taking on that state. The probabilit probability y y mass from a state a random variable x is denoted (x), withmaps thatThe x =probabilit as Pfunction a probability of 1 of indicating that x = x to is the probabilit of that random variable taking state. The Sometimes probability = xthat certain and a yprobabilit probability y of 0 indicating that x on is imp impossible. ossible. = x P ( x = x is that x biguate is denoted , with a probability of 1 indicating that xvariable to disam disambiguate whic which has PMF )to use, we write the name of the random certain andP (ax probabilit y of 0 indicating x = xfirst, is imp ossible. Sometimes explicitly: we define athat variable then use ∼ notation to = x). Sometimes to disam biguate which PMF to use, we write of the random variable sp specify ecify whic which h distribution it follo follows ws later: x ∼ Pthe (x)name . explicitly: P (x = x). Sometimes we define a variable first, then use notation to Probabilit Probability mass functions can on many at the same time. Suc Such h specify which ydistribution it follo wsact later: x Pv(ariables x). ∼ a probability distribution over many variables is known as a joint pr prob ob obability ability ∼ variables at the same time. Such Probabilit y mass functions can act on many P ( = = y = x distribution distribution.. x x, y ) denotes the probabilit probability y that x and y = y a probability distribution o v er many v ariables is known as a joint probability sim simultaneously ultaneously ultaneously.. We ma may y also write P (x, y) for brevity brevity.. distribution. P (x = x, y = y) denotes the probability that x = x and y = y To be a probability variable x, a function P must simultaneously . We maymass also function write P (on x, ya) random for brevity . satisfy the follo following wing prop properties: erties: To be a probability mass function on a random variable x, a function P must satisfy the domain following erties: • The of prop P must be the set of all possible states of x. (xP ) ≤must 1. An • ∀ x ∈domain x, 0 ≤ Pof The beimp theossible set of ev allen ptossible states of impossible even ent has probabilit probability y 0x.and no state can even en entt that is guaran guaranteed teed to happ happen en • be less probable than that. Likewise, an ev 0 ( x ) 1 . x x , P An imp ossible ev en t has probabilit y 0 and no state can has probabilit probability y 1, and no state can ha hav ve a greater chance of occurring. b∀e less than • P ∈ probable ≤ ≤ that. Likewise, an event that is guaranteed to happen P (x) = y • hasx∈x probabilit , and no to state haerty ve aasgreater chance ofdo. ccurring. 11.. 1W e refer thiscan prop property being normalize normalized Without this prop propert ert erty y, we could obtain probabilities greater than one by computing the P (xy) = . Wof e refer this erty as being normalized. Without this probabilit probability of 1one man many ytoev even en ents tsprop occurring. • property, we could obtain probabilities greater than one by computing the of one ofa man y ev ents occurring. Forprobabilit example,yconsider single discrete random variable x with k differen differentt states. We can place a uniform distribution on x —that is, make each of its states equally ForPexample, a single discrete randomtovariable x with k different states. lik likely—b ely—b ely—by y settingconsider its probabilit probability y mass function We can place a uniform distribution on x —that is, make each of its states equally likely—by setting its probability mass function1 to P (x = xi ) = (3.1) k 1 (xrequirements =x)= (3.1) for all i. We can see that this fits P the k for a probability mass function. The value 1k is positiv ositivee because k is a positiv ositivee in integer. teger. We also see that for all i. We can see that this fits the requirements for a probability mass function. X e1 integer. The value is positive bX ecause k is a positiv We also see that k = = 11,, P (x = xi ) = (3.2) k k i i 1 k P (x = x ) = (3.2) = = 1, k so the distribution is prop properly erly normalized. k 57 so the distribution is properly normalized. X X
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.3.2
Con Contin tin tinuous uous Variables and Probabilit Probability y Densit Density y Functions
When orking contin tinuous uous random variables, we describ describeey probabilit probability y dis3.3.2 wCon tinwith uouscon Vtin ariables and Probabilit y Densit Functions tributions using a pr prob ob obability ability density function (PDF) rather than a probability When w orking with con tinuous yrandom ariables, awe describpe mprobabilit dismass function. To be a probabilit probability densit density yvfunction, function ust satisfyy the tributions using a probability density function (PDF) rather than a probability follo following wing prop properties: erties: mass function. To be a probability density function, a function p must satisfy the Theprop domain of p must be the set of all possible states of x. follo•wing erties: • ∀ x ∈domain x, p(x) ≥ Note weset doofnot x) ≤ 1of . x. The of 0p. m ust that be the allrequire possiblep(states R •• xp(x)xdx , p(= x)11.. 0. Note that we do not require p(x) 1. • ∀probabilit ≥ A probability density y function p(x) do does es not giv givee the ≤ probability of a sp specific ecific p(∈x)dx =y 1densit . state directly directly, , instead the probability of landing inside an infinitesimal region with • probability density function p(x) does not give the probability of a specific A volume δx is given by p(x)δx. state directly, instead the probability of landing inside an infinitesimal region with We can integrate the densit density y function to find the actual probability mass of a volume δx is given by p(x)δx. R set of points. Specifically Specifically,, the probabilit probability y that x lies in some set S is giv given en by the W e can integrate the densit y function to find example, the actualthe probability mass of xa in integral tegral of p (x) ov over er that set. In the univ univariate ariate probabilit probability y that S R set of oints. Specifically probabilit y that x lies in some set is given by the lies in pthe in interv terv terval al [a, b] is, the given by ] p(x)dx. integral of p (x) over that set. In the[a,b univ ariate example, the probability that x For an example of a probability density function corresp corresponding onding to a sp specific ecific lies in the interval [a, b] is given by p(x)dx. probabilit probability y density over a contin continuous uous random variable, consider a uniform distribuor an example a probability density function corresp to u a (sp x; ecific a, b), tionFon an in interv terv terval al ofofthe real num numbers. bers. We can do this with aonding function probabilit y density o v er a contin uous random v ariable, consider a uniform distribuwhere a and b are the endpoints of the in interv terv terval, al, with b > a. The “;” notation means u( xa ; a, b), tion on an in terv al of the real num bers. W e can do this withfunction, a function “parametrized by”; we consider x to Rbe the argument of the while and a and b are that b> a. The the endpoints the intervT al,o with means bwhere are parameters define theoffunction. ensure that there“;”isnotation no probability “parametrized by”;in wterv e consider while a[ a, and x toub(ex;the a, b)argument [a, b]. Within b], x 6∈function, mass outside the interv terval, al, we say = 0 for of allthe bu(are parameters that define the function. T o ensure that there is no probability 1 x; a, b) = b−a . We can see that this is nonnegative everywhere. Additionally dditionally,, it u ( x ; a, b [ a, b a, bb]], x mass outside the in terv al, we say ) = 0 for all ] . Within in integrates tegrates to 1. We often denote that x follo follows ws the uniform distribution on [[a, uy(xwriting ; a, b) =x ∼ .UW Additionally, it 6∈ b (a,e bcan ). see that this is nonnegative everywhere. integrates to 1. We often denote that x follows the uniform distribution on [a, b ] by writing x U (a, b). 3.4 Marginal Probability ∼ Sometimes we know the probabilit probability y distribution ov over er a set of variables and we wan wantt 3.4 Marginal Probability to know the probability distribution over just a subset of them. The probability Sometimes woevknow probabilit ywn distribution ov er a set ofability variables and we want distribution er thethe subset is kno known as the mar marginal ginal pr prob ob obability distribution. to know the probability distribution over just a subset of them. The probability For example, supp suppose ose we ha hav ve discrete random variables x and y, and we know distribution over the subset is known as the marginal probability distribution. P (x, y). We can find P (x) with the sum rule: For example, suppose we have discrete random variables x and y, and we know X ∀ x ∈ x , P ( x = x ) = (3.3) P (x, y). We can find P (x) with the sum ruleP: (x = x, y = y ). x
∀ ∈
x, P (x = x) =58
y
X
P (x = x, y = y ).
(3.3)
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The name “marginal probabilit probability” y” comes from the pro process cess of computing marginal probabilities on pap paper. er. When the values of P (x, y ) are written in a grid with Thetname “marginal probabilit y” comes vfrom of computing marginal differen different values of x in rows and different aluesthe of pro columns, it is natural to y incess P ( probabilities on pap er. When the v alues of x , y ) are written in a grid with sum across a row of the grid, then write P ( x) in the margin of the pap paper er just to differen t v alues of in rows and different v alues of in columns, it is natural to x y the righ rightt of the ro row. w. sum across a row of the grid, then write P ( x) in the margin of the paper just to For contin continuous uous variables, we need to use integration instead of summation: the right of the row. Z For continuous variables, we need to use integration instead of summation: p(x) = p(x, y )dy dy.. (3.4) p(x) =
3.5
p(x, y )dy.
(3.4)
Conditional Probability
In cases, we are inte interested rested in theZprobabilit probability y of some even event, t, given that some 3.5manyConditional Probability other ev even en entt has happened. This is called a conditional pr prob ob obability ability ability.. We denote In many cases, we are inte rested in the probabilit y of some even some x). This | x = that the conditional probabilit probability y that y = y giv given en x = x as P (y = t,y given other event probabilit has happened. is called a conditional conditional probability y can bThis e computed with the form formula ulaprobability. We denote the conditional probability that y = y given x = x as P (y = y x = x). This P (y = x = ula x) conditional probability can be computed with they,form | . (3.5) P (y = y | x = x) = P (x = x) P (y = y, x = x) . (3.5) P (y = y x = x) = PP (x( x==x)x) > 0. We cannot compute The conditional probability is only defined when | the conditional probabilit probability y conditioned on an ev even en entt that nev never er happ happens. ens. The conditional probability is only defined when P ( x = x) > 0. We cannot compute It is imp important ortant not to confuse conditional probability with computing what the conditional probability conditioned on an event that never happens. would happ happen en if some action were undertaken. The conditional probability that It is imp ortantGerman not to yconfuse conditional probability computing what a person is from Germany giv given en that they sp speak eak Germanwith is quite high, but if would happen if someperson actioniswere undertaken. The conditional probability that a randomly selected taught to sp German, their coun of origin speak eak country try a peserson is from German y givthe en consequences that they speak German butan if do does not change. Computing of an action isis quite calledhigh, making a randomly selected person is taught spthe eak domain German, try of, origin intervention query query.. Interv Intervention ention queriestoare of their causalcoun mo modeling deling deling, which do es not change. Computing the consequences of an action is called making an we do not explore in this book. intervention query. Intervention queries are the domain of causal modeling, which we do not explore in this book.
3.6
The Chain Rule of Conditional Probabilities
An Any y join joint t probabilit probability y distribution man many y random variables ma may y be decomp decomposed osed 3.6 The Chain Rule ofover Conditional Probabilities in into to conditional distributions over only one variable: Any joint probability distribution over many random variables may be decomposed (1) n variable: (i) into conditional | x (1), . . . , x(i−1) ). P (xdistributions , . . . , x (n) ) o=ver P (only x(1))Πone (3.6) i=2P (x x ,...,x )Π P (x P (ation x , . is . . ,kno x wn ) =asPthe (x chain . (3.6) This observ observation known rule or pr pro oduct rule of )probability probability. . It | follo follows ws immediately from the definition of conditional probability in Eq. 3.5. For This observation is known as the chain rule or product rule of probability. It follows immediately from the definition59of conditional probability in Eq. 3.5. For
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
example, applying the definition twice, we get example, applying thePdefinition twice, (a, b, c) = P (aw|ebget , c)P (b, c) P (b, c) P (a, b, c) P (a, b, c) P (b, c)
3.7
= = = =
P (b | c)P (c) P (a b, c)P (b, c) P (a || b, c)P (b | c)P (c). P (b c)P (c)
P (a, b, c) = P (a | b, c)P (b c)P (c). Indep Independence endence and Conditional Independence endence | | Indep
T wo random variables x and yand are indep independent endent if theirIndep probability distribution can 3.7 Indep endence Conditional endence be expressed as a pro product duct of tw two o factors, one inv involving olving only x and one inv involving olving Two random variables x and y are independent if their probability distribution can only y: be expressed as a product of two factors, one involving only x and one involving only y: ∀x ∈ x, y ∈ y, p(x = x, y = y ) = p(x = x)p(y = y). (3.7) x, y xy,and p(xy=are x, yconditional = y ) = p(xly=indep x)p(endent y = y).given a random (3.7) Two random xvariables onditionally independent variable z if the∀conditional ∈ ∈ probability distribution over x and y factorizes in this T wo random v ariables way for ev every ery value of z: x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z: ∀x ∈ x, y ∈ y, z ∈ z, p(x = x, y = y | z = z) = p(x = x | z = z )p(y = y | z = z). (3.8) x x, y y, z z, p(x = x, y = y z = z) = p(x = x z = z )p(y = y z = z). We can denote indep independence endence and conditional indep independence endence with compact (3.8) ∀ ∈ ∈ ∈ | | | notation: x⊥y means that x and y are indep independen enden endent, t, while x⊥y | z means that x can denote indep endence conditional independence with compact andW y eare conditionally indep independen enden endentand t giv given en z. notation: x y means that x and y are independent, while x y z means that x and y are conditionally independent given z. ⊥ ⊥ |
3.8
Exp Expectation, ectation, Variance and Co Cov variance
f (xv)ariance The expeeExp ctation or exp expeecte cted d value of some function with resp respect ect to a probabilit probability y 3.8 exp ectation, Variance and Co distribution P (x ) is the av average erage or mean value that f tak takes es on when x is drawn f ( x The exp e ctation or exp e cte d value of some function ) with ect to a probability from P . For discrete variables this can be computed with aresp summation: distribution P (x ) is the average or mean value that f takes on when x is drawn X from P . For discrete variables a summation: [f (xcan )] =be computed Ex∼Pthis P (x)f (x)with , (3.9) x E [f (x)] = P (x)f (x), (3.9) while for con contin tin tinuous uous variables, it is computed with an integral: Z while for continuous variables, it is computed with an integral: Ex∼p [f (x)] = p(x)f (x)dx. (3.10) X E [f (x)] = p(x)f (x)dx. (3.10) 60
Z
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
When the iden identit tit tity y of the distribution is clear from the context, we ma may y simply E x[f ( x )] write the name of the random variable that the exp expectation ectation is ov over, er, as in inE )].. When the iden tit y of the distribution is clear from the context, w e ma y simply If it is clear whic which h random variable the expectation is over, we may E omit the [f (ov xer write the name of the random v ariable that the exp ectation is ov er, as in )]. subscript en entirely tirely tirely,, as in E[f (x)] )].. By default, we can assume that E[·] averages over If it is clear whic h random v ariable the expectation is o v er, w e may omit the the values of all the random brackets. ets. Lik Likewise, ewise, E variables inside the brack E when there is [f (xthe subscript entirely as inomit )]. square By default, we no am ambiguit biguit biguity y, we, may brack brackets. ets.can assume that [ ] averages over the values of all the random variables inside the brackets. Likewise, · when there is Exp Expectations ectations are linear, for example, no ambiguity, we may omit the square brackets. Expectations are forβexample, αf (x) + g(x)] = αEx [f (x)] + β Ex[g(x)], Ex [linear, E E E (xendent ) + βg(on x)] x=. α [f (x)] + β [g(x)], when α and β are not[αf dep dependent
(3.11) (3.11)
The varianc variancee gives a measure of ho how w muc uch h the values of a function of a random when α and β are not dependent on x. variable x vary as we sample differen differentt values of x from its probability distribution: The variance gives a measure of hohw much the values iof a function of a random 2 probability distribution: x[ffrom variable x vary as we sample t v(alues Var(f (differen x)) = E f (x) of −E (x)])its . (3.12)
E E V f (x )) = of f(f(x(x) )cluster [f (near x)]) their . exp When the variance is lo low, w,ar( the values expected ected value.(3.12) The − square ro root ot of the variance is known as the standar standard d deviation deviation.. When the variance is low, the values of f (x ) cluster near their expected value. The Thero covarianc ovariance gives es some sense ofashow much h tw two alues are. linearly related to square ot of thee vgiv ariance is known themuc standar do v deviation h variables: i eac each h other, as well as the scale of these The covariance gives some sense of how much two values are linearly related to each other, as wv(ell variables: Co Cov( f (as x),the g(yscale )) = Eof[(these f (x) − E [f (x)]) (g (y) − E [g(y)])] . (3.13) E E E Covv( f (x),ofg(the y)) = [(f (x) mean [f (that x)]) (the g (y)values [g(change y)])] . very (3.13) High absolute alues cov covariance ariance muc much h and are both far from their resp means at the same time. If the sign of the respectiv ectiv ectivee − − High absolute valuese,ofthen the bcov mean that theevon alues change high veryvmuc h co cov variance is positiv ositive, othariance variables tend to tak take relatively alues and are both far theirofresp e meansis at the same time. the sign of the sim simultaneously ultaneously ultaneously. . Iffrom the sign theectiv co cov variance negative, then one Ifvariable tends to co v ariance is p ositiv e, then b oth v ariables tend to tak e on relatively high v alues tak takee on a relatively high value at the times that the other tak takes es on a relatively low sim ultaneously If theOther sign of the covariance negative, then one vthe ariable tends to value and vice .versa. measures such as iscorr normalize con orrelation elation contribution tribution takeac e on relatively high to value at theonly times thethe other takes on relatively low of each h vaariable in order measure ho how wthat much variables arearelated, rather value also and b vice Other asseparate correlation normalize the contribution than eingversa. affected by measures the scale such of the variables. of each variable in order to measure only how much the variables are related, rather The notions of co cov variance and dep dependence endence are related, but are in fact distinct than also being affected by the scale of the separate variables. concepts. They are related because two variables that are indep independent endent hav havee zero The notions of co v ariance and dep endence are related, but are in fact distinct co cov variance, and tw two o variables that hav havee non-zero cov covariance ariance are dep dependent. endent. Ho Howwconcepts. They are related b ecause t w o v ariables that are indep endent hav e zero ev ever, er, independence is a distinct prop property erty from co cov variance. For two variables to hav havee co v ariance, and tw o v ariables that hav e non-zero cov ariance are dep endent. Ho wzero co cov variance, there must be no linear dep dependence endence betw etween een them. Indep Independence endence ever, independence is a distinct propco erty from cobvecause ariance. Forendence two variables to have is a stronger requirement than zero cov variance, indep independence also excludes zero covariance, there must no linear endence betw Indepbut endence nonlinear relationships. It isbepossible fordep tw two o variables toeen be them. dep dependent endent ha hav ve is a stronger requirement than zero co v ariance, b ecause indep endence also excludes zero cov covariance. ariance. For example, suppose we first sample a real num number ber x from a nonlinear relationships. It is p ossible for tw o v ariables to b e dep endent have uniform distribution over the in interv terv terval al [− 1, 1] 1].. We next sample a randombut variable zero covariance. For example, suppose we first sample a real number x from a uniform distribution over the interval [ 611, 1]. We next sample a random variable −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
s. With probabilit probability y 12, we choose the value of s to be 1. Otherwise, we choose y by assigning the value of s to be − 1. We can then generate a random variable ariabley sy.=With s probabilit y , we choose the v alue of to b e 1 . Otherwise, we choose Clearly,, x and y are not indep independen enden endent, t, because x completely determines sx. Clearly s y the magnitude value of to 1. W eer,can then the of bye. How Howev ev ever, Co Cov( v(x, ygenerate ) = 0. a random variable by assigning are not indep endent, because x completely determines y = sx. Clearly, x and y − The covarianc ovariancee matrix of a random vector x ∈ Rn is an n × n matrix, suc such h that the magnitude of y. However, Cov(x, y) = 0. R Cov( v(x) i,j = Co Cov( v( v(x xxi, x j ). is an n n matrix, suc(3.14) The covariance matrix of Co a random vector h that ∈ × The diagonal elemen elements ts of theCo co cov variance give v( x) = Co v(the x , xvariance: ).
(3.14)
Cov( v(x xi , xi) = V Var( ar( ar(x . The diagonal elements of the Co covv( ariance give thexiv)ariance:
(3.15)
Cov(x , x ) = Var(x ).
3.9
(3.15)
Common Probability Distributions
Sev Several simple probability distributionsDistributions are useful in many con contexts texts in machine 3.9eral Common Probability learning. Several simple probability distributions are useful in many contexts in machine learning.
3.9.1
Bernoulli Distribution
The is a distribution ov Bernoulli li distribution over er a single binary random variable. 3.9.1Bernoul Bernoulli Distribution [0,, 1] It is controlled by a single parameter φ ∈ [0 1],, whic which h gives the probability of the The Bernoul li distribution is a distribution ov er a single binary random variable being equal to 1. It has the following prop properties: erties:random variable. [0 , φ It is controlled by a single parameter 1], which gives the probability of the =∈1) φ (3.16) random variable being equal to 1.PIt(xhas the=following properties: P (Px(= 0)1) == 1− x= φφ
P (x P=(xx)==0)φx=(11 − φφ)1−x ] =(1φ − φ) P (x = xE) x=[xφ φ(1φ− φ) Var x(Ex)[x=] =
3.9.2
Var (x) = φ(1
Multinoulli Distribution
φ)
(3.17) (3.16)
(3.18) (3.17) (3.19) (3.18) (3.20) (3.19) (3.20)
−
3.9.2multinoul Multinoulli Distribution The multinoulli li or cate ategoric goric gorical al distribution is a distribution ov over er a single discrete variable with k differen differentt states, where k is finite.1 The multinoulli distribution is The multinoulli or categorical distribution is a distribution over a single discrete 1 “Multinoulli” is a termt that waswhere recently by Gustavo anddistribution popularized by k coined variable with k differen states, is finite. The mLacerdo ultinoulli is
Murphy (2012). The multinoulli distribution is a special case of the multinomial distribution. A multinomial distribution is the distribution over vectors in {0 , . . . , n} k representing how many times each of the k categories is visited when n samples are drawn from a multinoulli distribution. Many texts use the term “multinomial” to refer to multinoulli distributions without clarifying that they refer only to the n = 1 case. 62
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
[0,, 1]k−1 , where pi giv parametrized by a vector p ∈ [0 gives es the probability of the i-th state. The final, k -th state’s probability is given by 1 − 1> p. Note that we must , 1] , where p [0 parametrized givesused the to probability of the i-th constrain Multinoulli distributions arep often refer to distributions ≤ 1a. vector 1 > p by k p . 1 The final,of ob -thjects, state’s probability is given by 1 that state Note1 has thatnumerical we must ostate. ver categories objects, so∈we do not usually assume constrain 1 . Multinoulli distributions are often used to refer to distributions p 1 − compute the exp value 1, etc. For this reason, we do not usually need to expectation ectation o v er categories of ob jects, so we do not usually assume ≤ or variance of multinoulli-distributed random variables.that state 1 has numerical value 1, etc. For this reason, we do not usually need to compute the expectation The Bernoulli and multinoulli distributions are sufficient to describ describee an any y distrior variance of multinoulli-distributed random variables. bution over their domain. This is because they mo model del discrete variables for whic which h The Bernoulli and multinoulli distributions are sufficient to describ e an y distriit is feasible to simply enumerate all of the states. When dealing with contin continuous uous bution overthere theirare domain. This because theyso moany del distribution discrete variables foredwhic v ariables, uncoun uncountably tablyis many states, describ described by h a it is feasible to simply enumerate all of the states. When dealing with contin uous small num umb ber of parameters must imp impose ose strict limits on the distribution. variables, there are uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.
3.9.3
Gaussian Distribution
The mostGaussian commonly Distribution used distribution over real num numb bers is the normal distribution, 3.9.3 also kno as the Gaussian distribution : known wn The most commonly used distribution over real numbers is the normal distribution, r also known as the Gaussian distribution : 1 1 2 2 exp − 2 (x − µ) . N (x; µ, σ ) = (3.21) 2πσ2 2σ 1 1 exp (3.21) (x; µ, σ ) = (x µ) . See Fig. 3.1 for a plot of the densit density y function. 2π σ 2σ N − − The tw two o parameters R and σ y∈ function. (0, ∞ ) control the normal distribution. See Fig. 3.1 for a plotµof∈ the densit The parameter µ giv gives es the co coordinate ordinate R r of the central peak. This is also the mean of normal distribution. two parameters and σ (0 , ) control E[ x] = µµ. The standard the The distribution: deviation of thethe distribution is given by µ The parameter giv es the ordinate of the central p eak. This is also the mean of 2 co ∈ ∞ σ, and the variance E by σ . ∈ the distribution: [ x] = µ. The standard deviation of the distribution is given by When we ev evaluate aluate the PDF, we need to square and inv invert ert σ. When we need to σ, and the variance by σ . frequen frequently tly ev evaluate aluate the PDF with differen differentt parameter values, a more efficient way When we evaluate the PDF, weisneed to asquare and inv When we needthe to of parametrizing the distribution to use parameter ) to control β ert ∈ (0σ,. ∞ frequen tlyorevinv aluate PDF of with t parameter values, a more efficient way pr pre ecision inverse erse the variance thedifferen distribution: of parametrizing the distribution is to use a parameter β (0, ) to control the precision or inverse variance of ther distribution: ∈ ∞ β 1 −1 (3.22) exp − β (x − µ)2 . N (x; µ, β ) = 2π 2 β 1 (3.22) exp (x; µ, β ) = β (x µ) . Normal distributions are a sensible2choice for many applications. In the absence π 2 − of prior knowledge N ab about out what form a distribution ov− er the real num numbers bers should Normal distributions are a sensible choice for many applications. In the absence tak take, e, the normal distribution is a go goo od default choice for two ma major jor reasons. of prior knowledge about what form over thereal numbers should r a distribution First, many distributions we wish to mo model del are truly close to being normal take, the normal distribution is a good default choice for two ma jor reasons. distributions. The centr entral al limit the theor or orem em shows that the sum of many indep independent endent First, many distributions we wish to mo del are truly close to being normal random variables is approximately normally distributed. This means that in distributions. The central limit theorem shows that the sum of many independent 63 random variables is approximately normally distributed. This means that in
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The normal distribution 0.35 0.40 0.30 0.35 0.25 0.30 0.20 0.25 0.15 0.20 0.10 0.15 0.05 0.10 0.00 −2.0 0.05
The normal distribution Ma Maxim xim ximu um aatt x = ¹ x = at ¹ In"ection points Maxim x =u¹m§a¾t
p(x)
p(x)
0.40
In"ection at x = ¹ points ¾ §
−1.5
−1.0
−0.5
0.0 0.5 1.0 1.5 2.0 x 0.00 2 Figure 3.1: The normal distribution : The normal ) exhibits N (x; µ, σ 1.5 −2.0 −1.5 −1.0 −0.5 0.0 distribution 0.5 1.0 2.0a classic “b “bell ell curv curve” e” shape, with the x co coordinate ordinate of its central peak given by µ, and the width x distribution (x; µ, σ ) exhibits a classic Thetrolled normalbydistribution Figure The normal of its p3.1: eak con controlled we depict the standar standard d normal distribution distribution,, σ. In this :example, “b ell µ curv shape, with = 0e”and σ =with 1. the x co ordinate of its central peakNgiven by µ, and the width of its p eak controlled by σ. In this example, we depict the standard normal distribution, with µ = 0 and σ = 1.
practice, man many y complicated systems can be mo modeled deled successfully as normally distributed noise, even if the system can be decomp decomposed osed into parts with more practice, man y complicated systems can b e mo deled successfully as normally structured beha ehavior. vior. distributed noise, even if the system can be decomposed into parts with more Second, out of all possible probability distributions with the same variance, structured behavior. the normal distribution enco encodes des the maxim maximum um amount of uncertaint uncertainty y ov over er the Second, out of all p ossible probability distributions with the same v ariance, real num umb bers. We can th thus us think of the normal distribution as being the one that the normal distribution des the maximum uncertaint y over and the inserts the least amoun amountt enco of prior kno knowledge wledge in into toamount a mo model. del.of F Fully ully dev developing eloping real n um b ers. W e can th us think of the normal distribution as b eing the one that justifying this idea requires more mathematical to tools, ols, and is postp postponed oned to Sec. inserts 19.4.2 19.4.2.. the least amount of prior knowledge into a model. Fully developing and justifying this idea requires more mathematical tools, and is postponed to Sec. The normal distribution generalizes to Rn, in whic which h case it is known as the 19.4.2. multivariate normal distribution distribution.. It ma may y be Rparametrized with a positiv ositivee definite The normal distribution generalizes to , in whic h case it is known as the symmetric matrix Σ: multivariate normal distribution. It may be parametrized with a positive definite s symmetric matrix Σ: 1 1 > −1 exp − (x − µ) Σ (x − µ) . N (x; µ, Σ) = (3.23) (2π) ndet(Σ) 2 1 1 exp (x; µ, Σ) = (x µ) Σ (x µ) . (3.23) (2π) det(Σ) 2 The gives es the mean− of the now w it is N parameter µ still giv − distribution, − though no vector-v ector-valued. alued. The parameter Σ giv gives es the cov covariance ariance matrix of the distribution. µ The parameter still giv es the mean of the distribution, w itfor is As in the univ univariate ariatescase, when we wish to ev evaluate aluate the PDF though sev several eral no times vector-valued. The parameter Σ gives the covariance matrix of the distribution. 64 to evaluate the PDF several times for As in the univariate case, when we wish
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
man many y differen differentt values of the parameters, the co cov variance is not a computationally efficien efficientt wa way y to parametrize the distribution, since we need to inv invert ert Σ to ev evaluate aluate man y differen t v alues of the parameters, the co v ariance is not a computationally the PDF. We can instead use a pr preecision matrix β: efficient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead usesa precision matrix β: det( β ) 1 −1 > N (x; µ, β ) = exp − (x − µ) β(x − µ) . (3.24) (2π)n 2 det(β) 1 (x; µ, β ) = (x µ) β(x µ) . exp (3.24) π) to be a2diagonal matrix. An even simpler We often fix the co cov variance (2 matrix − whose − co − matrix is a scalar version is theNisotr isotropic opic Gaussian distribution, cov variance Wethe often fix covariance matrix to be a diagonal matrix. An even simpler times iden identit tit tity ythe matrix. s distribution, version is the isotropic Gaussian is a scalar whose covariance matrix times the identity matrix.
3.9.4
Exp Exponen onen onential tial and Laplace Distributions
3.9.4 Exponen tiallearning, and Laplace In the context of deep we oftenDistributions wan wantt to hav havee a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exp exponential onential In the context of deep learning, w e often wan t to hav e a probability distribution distribution distribution:: with a sharp point at x =p(0x.; λT)o=accomplish this, λ1x≥0 exp (− λx) w . e can use the exponential (3.25) distribution: The exp exponen onen onential tial distribution probability p(uses x; λ) the = λindicator 1 exp (function λx) . 1 x≥0 to assign probabilit (3.25)y zero to all negativ negativee values of x. − 1 The exponential distribution uses the indicator function to assign probability A closely related probabilit probability y distribution that allo allows ws us to place a sharp peak zero to all negative values of x. of probabilit probability y mass at an arbitrary poin ointt µ is the Laplac aplacee distribution A closely related probability distribution that allows us to place a sharp peak | x − µ | 1 of probability mass atLaplace( an arbitrary is the−Laplace distribution x; µ, γ )poin = t µexp . (3.26) 2γ γ x µ 1 Laplace(x; µ, γ ) = exp . (3.26) 2γ γ | | − 3.9.5 The Dirac Distribution and Empirical Distribution − 3.9.5 The Dirac In some cases, we wishDistribution to sp specify ecify that and all of Empirical themass in aDistribution probability y distribution probabilit clusters around a single poin oint. t. This can be accomplished by defining a PDF using In some wish to δsp the Diraccases, deltawe function, (xecify ): that all of the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using (3.27) the Dirac delta function, δ(x): p(x) = δ(x − µ). The Dirac delta function is defined zero-valued alued everywhere except p(x)such = δ(that x µit).is zero-v (3.27) 0, yet integrates to 1. The Dirac delta function is not an ordinary function that − The Dirac delta function is defined such that it is zero-valued except asso associates ciates eac each h value x with a real-v real-valued alued output, instead it is everywhere a different kind of 0, y et integrates to 1. The Dirac delta function is not an ordinary function that mathematical ob object ject called a gener generalize alize alized d function that is defined in terms of its asso ciates h vintegrated. alue x withW a ereal-v alued of output, instead is a different kindthe of prop properties erties eac when can think the Dirac deltait function as being mathematical ject called a generthat alizeput d function defined in pterms of its limit poin ointt of aob series of functions less andthat less ismass on all oints other prop erties when integrated. W e can think of the Dirac delta function as b eing the than µ. limit point of a series of functions that put less and less mass on all points other 65 than µ.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
By defining p( x) to be δ shifted by −µ we obtain an infinitely narrow and infinitely high peak of probabilit probability y mass where x = µ. By defining p( x) to be δ shifted by µ we obtain an infinitely narrow and A common use of the Dirac delta distribution is as a component of an empiric empirical al infinitely high peak of probability mass where x = µ. − distribution distribution,, m A common use of the Dirac delta 1distribution is as a component of an empirical X pˆ(x) = (3.28) δ(x − x(i)) distribution, m i=1 1 pˆ(x1 ) = (3.28) δ(x x ) whic which h puts probability mass m on m eac each h of the m poin oints ts x (1) , . . . , x(m) forming − a giv given en data set or collection of samples. The Dirac delta distribution is only , . . . , x Forforming which puts mass distribution on each of ov the poin ts xvariables. necessary to probability define the empirical over er m contin continuous uous discrete a giv en data set or collection of samples. The Dirac delta distribution is only X variables, the situation is simpler: an empirical distribution can be conceptualized necessary to define the empirical distribution over asso contin uous to variables. For discrete as a multinoulli distribution, with a probability each possible input associated ciated v ariables, the situation is simpler: an empirical distribution can b e conceptualized value that is simply equal to the empiric empirical al fr freequency of that value in the training as a m ultinoulli distribution, with a probability associated to each possible input set. value that is simply equal to the empirical frequency of that value in the training We can view the empirical distribution formed from a dataset of training set. examples as sp specifying ecifying the distribution that we sample from when we train a model W e can view the empirical distribution formed from a dataset of training on this dataset. Another imp important ortant persp erspective ective on the empirical distribution is examples as sp ecifying the distribution that w e sample from when w e train a model that it is the probabilit probability y density that maximizes the likelihoo likelihood d of the training data on this dataset. Another imp ortant p ersp ective on the empirical distribution is (see Sec. 5.5). that it is the probability density that maximizes the likelihood of the training data (see Sec. 5.5).
3.9.6
Mixtures of Distributions
3.9.6 Distributions It is alsoMixtures common toof define probability distributions by com combining bining other simpler probabilit probability y distributions. One common w wa ay of com combining bining distributions is to It is also common to define probability distributions by com construct a mixtur mixturee distribution distribution.. A mixture distribution is bining made other up of simpler several probabilit y distributions. One common w a y of com bining distributions is to comp componen onen onentt distributions. On eac each h trial, the choice of whic which h comp component onent distribution construct the a mixtur e distribution . A mixture distribution is made uptit of generates sample is determined by sampling a comp component onent iden identit tity y several from a comp onen t distributions. On eac h trial, the choice of whic h comp onent distribution multinoulli distribution: generates the sample is determined by sampling a component identity from a X multinoulli distribution: P (x) = P (c = i)P (x | c = i) (3.29) i
P (c = i)P (x c = i) P (x) = where P (c) is the multinoulli distribution ov over er comp component | onent identities.
(3.29)
We ha hav ve already seen one example of a mixture distribution: the empirical where P (c) is the multinoulli distribution over component identities. distribution ov over er real-v real-valued alued variables is a mixture distribution with one Dirac X W e ha v e already seen oneexample. example of a mixture distribution: the empirical comp componen onen onentt for eac each h training distribution over real-valued variables is a mixture distribution with one Dirac The mixture mo model is one simple strategy for combining probability distributions comp onen t for eac hdel training example. to create a ric richer her distribution. In Chapter 16, we explore the art of building complex The mixture model is one simple strategy combining probabilit probability y distributions from simple ones informore detail. probability distributions to create a richer distribution. In Chapter 16, we explore the art of building complex 66 in more detail. probability distributions from simple ones
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The mixture mo model del allows us to briefly glimpse a concept that will be of paramoun paramountt imp importance ortance later—the latent variable. A laten latentt variable is a random The mixture mo del allows us to briefly glimpse a iden concept will of variable that we cannot observe directly directly.. The component identit tit tity y vthat ariable c ofbethe paramoun t del impprovides ortance an later—the variable . A may latenbtevrelated ariable to is xa through random mixture mo model example.latent Laten Latent t variables vthe ariable that w e cannot observe directly . The component iden tit y v ariable c ( c) joint distribution, in this case, P (x, c) = P (x | c )P (c ). The distributionofPthe delt provides Latent vP ariables may be related to xvariables through (x | c) relating omixture ver the mo laten latent variable an andexample. the distribution the latent P (xshap P (the )P (c ). ThePdistribution P ( c) x cdistribution thethe joint distribution, this case,the , c) = (x ) ev to visible variables in determines shape e of even en though P (x |to c) relating over the latenttovariable distribution thevariable. latent variables P (xthe it is p ossible describeand ) without reference the latent Laten Latentt P ( to the visible v ariables determines the shap e of the distribution x ) ev en though | variables are discussed further in Sec. 16.5. it is possible to describe P (x) without reference to the latent variable. Latent A veryare podiscussed werful andfurther common type16.5 of mixture mo model del is the Gaussian mixtur mixturee variables in Sec. . mo model, del, in whic which h the comp componen onen onents ts p (x | c = i ) are Gaussians. Each comp component onent has A v ery p o w erful and common type of mixture mo del is the Gaussian mixtur ( i ) ( i ) a separately parametrized mean µ and cov covariance ariance Σ . Some mixtures can hav haveee p ( = i x c mo del, in whic h the comp onen ts ) are Gaussians. Each comp onent has more constraints. For example, the cov covariances ariances could be shared across comp componen onen onents ts Σ . distribution, a separately parametrized and covariance Some mixtures have | a single via the constraint Gaussian thecan mixture Σ(i) = Σmean ∀i. Asµ with more constraints. Forconstrain example, the the cov could befor shared componen ts of Gaussians might co cov variances ariance matrix eac each h across component to be via the constraint Σ = Σ i. As with a single Gaussian distribution, the mixture diagonal or isotropic. of Gaussians might constrain ∀ the covariance matrix for each component to be In addition to the means and cov covariances, ariances, the parameters of a Gaussian mixture diagonal or isotropic. sp specify ecify the prior pr prob ob obability ability α i = P ( c = i) giv given en to each comp component onent i. The word In addition to the means and cov ariances, the parameters of a Gaussian mixture “prior” indicates that it expresses the mo model’s del’s beliefs about c before it has observed α = P ( = i i the prior prPob c )pr giv en to ,each comp . The after word By comparison, prob ob obability ability ability, because itonent is computed xsp. ecify ( cability | x) is a posterior b efore “prior” indicates that it expresses the mo del’s beliefs about it has observed c observ observation ation of x. A Gaussian mixture mo model del is a universal appr approximator oximator of after . By comparison, ayposterior obability , because it is computed xdensities, P ( c that x) isan in the sense any smo smooth oth pr densit density y can be approximated with an any y x observ ation of . A Gaussian mixture mo del is a universal appr oximator of | sp specific, ecific, non-zero amoun amountt of error by a Gaussian mixture model with enough densities, in the sense that any smooth density can be approximated with any comp componen onen onents. ts. specific, non-zero amount of error by a Gaussian mixture model with enough Fig. 3.2 sho shows ws samples from a Gaussian mixture mo model. del. components. Fig. 3.2 shows samples from a Gaussian mixture model.
3.10
Useful Prop Properties erties of Common Functions
Certain oftenerties while working with probabilit probability y distributions, especially 3.10 functions Usefularise Prop of Common Functions the probabilit distributions used in deep learning mo probability y models. dels. Certain functions arise often while working with probability distributions, especially of these functions is the logistic gistic sigmoid: : models. the One probabilit y distributions usedlo in deepsigmoid learning 1 One of these functions is the logistic sigmoid : σ(x) = . (3.30) 1 + exp(−x) 1 σ(x) = . (3.30) + exp( x) the φ parameter of a Bernoulli The logistic sigmoid is commonly used1 to pro produce duce distribution because its range is (0 (0,, 1) 1),, whic which h lies − within the valid range of values φ parameter The logistic sigmoid is commonly used to pro duce a Bernoulli for the φ parameter. See Fig. 3.3 for a graph of thethe sigmoid function.ofThe sigmoid distribution because its range is (0, 1), which lies within the valid range of values for the φ parameter. See Fig. 3.3 for a graph of the sigmoid function. The sigmoid 67
x2
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
x1
Figure 3.2: Samples from a Gaussian mixture mo model. del. In this example, there are three comp componen onen onents. ts. From left to righ right, t, the first comp component onent has an isotropic co cov variance matrix, Figure 3.2: Samples from a Gaussian mixture mo del. In this example, three meaning it has the same amount of variance in each direction. The secondthere has aare diagonal comp onen ts. F rom left to righ t, the first comp onent has an isotropic co v ariance matrix, co cov variance matrix, meaning it can control the variance separately along each axis-aligned meaning itThis has the same has amount variancealong in each Thealong second diagonal x 2 axis than x 1 aaxis. direction. example moreofvariance thedirection. thehas The covariance matrix, meaning it cancov control thematrix, varianceallowing separately along eachthe axis-aligned third comp componen onen onentt has a full-rank covariance ariance it to control variance direction. This has more along the x axis than along the x axis. The separately alongexample an arbitrary basisvariance of directions. third comp onent has a full-rank covariance matrix, allowing it to control the variance separately along an arbitrary basis of directions.
function satur saturates ates when its argument is very positiv ositivee or very negative, meaning that the function becomes very flat and insensitiv insensitivee to small changes in its input. function saturates when its argument is very positive or very negative, meaning Another commonly encountered function is the softplus function (Dugas et al., that the function becomes very flat and insensitive to small changes in its input. 2001 2001): ): Another commonly encountered function is the function (Dugas(3.31) et al., ζ (x) = log (1 + exp( x))softplus . 2001): The softplus function can be useful producing the ζ (x) =for logpro (1 ducing + exp(x )) .β or σ parameter of a normal (3.31) distribution because its range is (0, ∞ ). It also arises commonly when manipulating β or σ The softplusin function be useful producing parameter of afrom normal expressions inv volving can sigmoids. Theforname of the the softplus function comes the , distribution b ecause its range is (0 ) . It also arises commonly when manipulating fact that it is a smo smoothed othed or “softened” version of expressions involving sigmoids. The ∞name of the softplus function comes from the x+ = max(0 , x)of . (3.32) fact that it is a smoothed or “softened” version See Fig. 3.4 for a graph of the softplus function. x = max(0 , x). (3.32) following properties erties aresoftplus all useful enough that you may wish to memorize See The Fig. follo 3.4 wing for a prop graph of the function. them: The following properties are all useful enough that you may wish to memorize them: exp(x) σ(x) = (3.33) exp(x) + exp(0) exp(x) σd(x) = (3.33) exp( ) +− exp(0) σ(x) = σ(x)(1 σ(x)) (3.34) dx d σ(x) = σ68 (x)(1 σ(x)) (3.34) dx −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
The logistic sigmoid function 1.0
¾(x)
0.8
The logistic sigmoid function
0.6 0.4 0.2 0.0 −10
−5
0
5
10
x
Figure 3.3: The logistic sigmoid function. Figure 3.3: The logistic sigmoid function.
The softplus function 10
The softplus function
³(x)
8 6 4 2 0 −10
−5
0
5
x
Figure 3.4: The softplus function. Figure 3.4: The softplus function.
69
10
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
1 − σ(x) = σ (−x)
(3.35)
log 1 σσ(x (x))==−σζ((−xx)) dσ(x) = ζ (− x) log− ζ (x) = σ (x) dx − − d ζ−1 (x) = σ (x) x ∀x ∈ (0, 1)dx , σ (x) = log 1−x x (x)log = (exp( log x) − 1) , σ ∀xx > (0 0,, 1) ζ −1 (x) = 1 x ∀ ∈ Z x − x > 0,ζ ζ(x) (= x) = log σ((exp( y)dy x) 1) −∞ ∀ − ζζ((xx))= σ ( y ) dy − ζ (−x) = x
(3.36) (3.35) (3.36) (3.37) (3.37) (3.38) (3.38) (3.39) (3.39) (3.40) (3.40) (3.41)
The function σ−1 (x) is called the logit rarely ζ (xlo ) git ζin ( statistics, x) = x but this term is more (3.41) used in mac machine hine learning. − Zin−statistics, but this term is more rarely The function σ (x) is called the logit Eq. 3.41 pro provides vides extra justification for the name “softplus.” The softplus used in machine learning. function is intended as a smo smoothed othed version of the positive part function, x + = Eq. provides extra justification name “softplus.” The softplus max max{ {0, x3.41 } . The positive part function isfor thethe counterpart of the ne negative gative part function is intended as a smo othed v ersion of the p ositive p art function, x = − max{ {0, −x}. To obtain a smo function, x = max smooth oth function that is analogous to the max , x 0 . The p ositive part function is the counterpart of the ne gative p art negativ negativee part, one can use ζ (−x ). Just as x can be recov recovered ered from its positive part maxvia x .iden 0, the function, To tit obtain is analogous to {negativ }x e=part x− oth = x,function x and negative identit tity y x+ a−smo it is alsothat possible to reco recov verthe x ζ ( x negativ e part, one can use ) . Just as can b e recov ered from its p ositive part { − } using the same relationship bet etw ween ζ (x) and ζ (−x), as shown in Eq. 3.41. x = x, it is also possible to recover x and negative part via the iden − tity x using the same relationship between ζ (− x) and ζ ( x), as shown in Eq. 3.41. 3.11 Ba Bay yes’ Rule − W e often Ba findyourselv ourselves es in a situation where we know P ( y | x) and need to know 3.11 es’ Rule P ( x | y). Fortunately ortunately,, if we also know P (x), we can compute the desired quantit quantity y y x P ( W e often find ourselv es in a situation where we know ) and need to know using Bayes’ rule: P ( x y). Fortunately, if we also know PP(x(x),)P w(eycan | the desired quantity | x)compute P ( x | y ) = . (3.42) using| Bayes’ rule: P (y) P (x)P (y x) P (xin ythe ) =formula, it is . usually feasible to compute (3.42) P (y ) appears Note that while P (y) | P | not need to begin with knowledge of P (y). P (y) = x P (y | x)P (x), so we do Note that while P (y ) appears in the formula, it is usually feasible to compute Bay straightforward to deriv derive the definition of conditional P (y isxstraigh )P (x), tforward so we do not needetofrom begin with knowledge of P (y). P (yBa ) =yes’ rule probabilit probability y, but it is useful to know the name of this form formula ula since many texts | straightforward to derive from the definition of conditional yes’ rule is referBato it b y name. It is named after the Reverend Thomas Ba Bay yes, who first probabilit y , but it is useful to know the name of this form ula since many disco discov vered a sp special ecial case of the formula. The general version presented heretexts was refer to it b y name. It is named after the Reverend Thomas Ba y es, who first indep independen enden endently discov vered by Pierre-Simon Laplace. P tly disco discovered a special case of the formula. The general version presented here was independently discovered by Pierre-Simon Laplace. 70
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
3.12
Tec echnical hnical Details of Con Contin tin tinuous uous Variables
A prop proper er T formal understanding of of con contin tin tinuous uous random ariables and probabilit probability y 3.12 echnical Details Con tin uous vV ariables densit density y functions requires dev developing eloping probabilit probability y theory in terms of a branc branch h of A prop er formal understanding of con tin uous random v ariables and probabilit y mathematics kno known wn as me measur asur asuree the theory ory ory.. Measure theory is bey eyond ond the scop scopee of densit y functions requires dev eloping probabilit y theory in terms of a branc h of this textb textbo ook, but we can briefly sk sketc etc etch h some of the issues that measure theory is mathematics known emplo employ yed to resolv resolve. e. as measure theory. Measure theory is beyond the scope of this textbook, but we can briefly sketch some of the issues that measure theory is In ySec. 3.3.2 , wee.sa saw w that the probabilit probability y of a con contin tin tinuous uous vector-v vector-valued alued x lying emplo ed to resolv in some set S is given by the in integral tegral of p( x) ov over er the set S. Some choices of set S In Sec. 3.3.2 , w e sa w that the probabilit y of a conto tinconstruct uous vector-v S 1lying can pro produce duceSparadoxes. For example, it is possible tw two oalued sets x and S S pbut ( x)Sov∩ in some set pis(xgiven by the in tegral of erSthe=set . Some choices of set ) + p ( ) > ∈ S x ∈ S ∅ S suc such h that 1 . These sets are generally 2 1 2 1 2 S can produce making paradoxes. Fheavy or example, itthe is infinite possibleprecision to construct twonum setsbers,and constructed very use of of real numbers, for S S S S S p ( ) + p ( ) > = x x suc h that 1 but . These sets are generally example by making fractal-shap fractal-shaped ed sets or sets that are defined by transforming constructed making very heavy of real num bers,is for 2 ∈ use of the infinite ∈ ∩ precision ∅ of measure the set of rational num numbers. bers. One of the key contributions theory to example y making fractal-shap or sets are defined transforming pro provide vide abcharacterization of theed setsets of sets thatthat we can computeby the probability thewithout set of rational numbers. One of In thethis keybo contributions oftegrate measure to of encountering paradoxes. book, ok, we only in integrate ovtheory er sets is with provideely a characterization of the sets we can compute the relativ relatively simple descriptions, so set thisofasp aspect ectthat of measure theory nev never er probability becomes a of without encountering paradoxes. In this bo ok, w e only in tegrate o v er sets with relev relevant ant concern. relatively simple descriptions, so this aspect of measure theory never becomes a Fant or our purp purposes, oses, measure theory is more useful for describing theorems that relev concern. apply to most points in R n but do not apply to some corner cases. Measure theory For our purposes, measure theorythat is more fortsdescribing theorems that pro provides vides a rigorous wa way y Rof describing a setuseful of poin oints is negligibly small. Such apply to most p oints in but do not apply to some corner cases. Measure theory a set is said to hav havee “ me measur asur asuree zer zero o.” W Wee do not formally define this concept in this provides rigorous describing that a setthe of pin oin ts is negligibly Such textb textbo ook.a Ho How wev ever, er,wa it yisofuseful to understand intuition tuition that a setsmall. of measure a set oisccupies said tonohav e “ measur e zer o.” W e do formally Fdefine this concept zero volume in the space we arenot measuring. or example, withinin ,a R2this textb o ok. Ho w ev er, it is useful to understand the in tuition that a set of measure line has measure zero, while a filled polygon has positiv ositivee measure. Likewise, R an zero o ccupies no volume in the space we are measuring. F or example, within ,a individual point has measure zero. An Any y union of countably many sets that each line measure a filled pzero olygon ositiv e measure. Likewise, an ha hav vehas measure zerozero, alsowhile has measure (so has the p set of all the rational num numb bers individual point zero. Any union of countably many sets that each has measure zero,has formeasure instance). have measure zero also has measure zero (so the set of all the rational numbers term from measure theory is “ almost everywher everywheree.” A prop property erty has Another measure useful zero, for instance). that holds almost ev everywhere erywhere holds throughout all of space except for on a set of Another useful termthe from measure otheory almost everywher .” space, A propthey erty measure zero. Because exceptions ccupy isa “negligible amounteof thatbholds almost everywhere throughout allimp of space on a set of can e safely ignored for man many y holds applications. Some importan ortan ortanttexcept resultsfor in probability measure zero. thevalues exceptions occupy negligible amount for of space, they theory hold for Because all discrete but only hold a“almost everywhere” con contin tin tinuous uous can b e safely ignored for man y applications. Some imp ortan t results in probability values. theory hold for all discrete values but only hold “almost everywhere” for continuous Another tec technical hnical detail of contin continuous uous variables relates to handling contin continuous uous values. random variables that are deterministic functions of one another. Supp Suppose ose we ha hav ve Another tec hnical detail of contin uous v ariables relates to handling contin uous two random variables, x and y , suc such h that y = g (x ), where g is an inv invertible, ertible, conrandom variables that are deterministic functions of one another. Suppose we have 2 Banach-Tarski theorem sets. g is an invertible, conx andprovides y , sucha fun y = g (of x )such twoThe random variables, thatexample , where 71
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
tin tinuous, uous, differen differentiable tiable transformation. One migh mightt exp expect ect that py (y ) = p x (g−1(y )) )).. This is actually not the case. tinuous, differentiable transformation. One might expect that p (y ) = p (g (y )). As a simple example, supp suppose ose we ha hav ve scalar random variables x and y. Suppose This isx actually not the case. (0,, 1) y = 2 and x ∼ U (0 1).. If we use the rule py (y ) = p x(2 y) then p y will be 0 As a simple example, suppalose 1 have scalar random variables x and y. Suppose ev everywhere erywhere except the interv interval [0 [0,,we interv terv terval. al. This means 2 ], and it will b e 1 on this in U (0, 1). If we use y = and x the rule p (y ) = p (2 y) then p will be 0 Z everywhere except it will1 be 1 on this interval. This means ∼ the interval [0, ]p, and (3.43) y (y )dy = , 2 1 p (y)dy = , (3.43) whic which h violates the definition of a probabilit probability y distribution. 2 common is of wrong becausey it fails to accoun accountt for the distortion whicThis h violates the mistake definition a probabilit distribution. of space in intro tro troduced duced by the function g . Recall that the probability of x lying in Z bvecause This common is wrong toenaccoun distortion δxfails )δx.the g can an infinitesimally mistake small region with olume it is giv given by p(tx for Since g x of space in tro duced by the function . Recall that the probability of lying in expand or con contract tract space, the infinitesimal volume surrounding x in x space ma may y an small with volume δx is given by p(x )δx. Since g can ha hav vinfinitesimally e differen differentt volume in region y space. expand or contract space, the infinitesimal volume surrounding x in x space may To see ho how w to correct the problem, we return to the scalar case. We need to have different volume in y space. preserv preservee the prop propert ert erty y To see how to correct the to|. the scalar case. We need to |pyproblem, (g(x))dy dy|| w=e |return p x (x)dx (3.44) preserve the property Solving from this, we obtain p (g(x))dy = p (x)dx . (3.44) | | ∂ x | Solving from this, we obtain| p y (y) = px (g−1 (y)) (3.45) ∂y ∂x p (y) = p (g (y)) (3.45) or equiv equivalently alently ∂y ∂ g ( x) p ( x ) = p ( g ( x )) (3.46) x y or equivalently ∂x . ∂ g(x) p (ative x) = pgeneralizes (g(x)) to (3.46) . determinan In higher dimensions, the deriv derivative determinantt of the Jac Jacobian obian ∂xthe ∂xi alued vectors x and y , matrix matrix—the —the matrix with Ji,j = ∂y j . Th Thus, us, for real-v real-valued In higher dimensions, the derivative generalizes to the determinant of the Jacobian forreal-valued matrix—the matrix with J = . Thus, ∂ g(x ) vectors x and y , px (x) = py (g(x)) det (3.47) . ∂ x ∂ g(x ) p (x) = p (g(x)) det . (3.47) ∂x
3.13
Information Theory
that rev Information theory is a branc branch h of applied mathematics revolv olv olves es around 3.13 Information Theory quan quantifying tifying how muc much h information is presen present inven en ented ted t in a signal. It was originally inv Information theory is a branc of applied mathematics that channel, revolves such around to study sending messages fromhdiscrete alphab alphabets ets over a noisy as quan tifying how muc h information is presen t in a signal. It w as originally inv en ted comm communication unication via radio transmission. In this context, information theory tells how to study messages discrete alphab over of a messages noisy channel, such as to design sending optimal co codes des and from calculate the exp expected ectedets length sampled from communication via radio transmission. In this context, information theory tells how to design optimal codes and calculate the72expected length of messages sampled from
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
sp specific ecific probability distributions using various enco encoding ding schemes. In the con context text of mac machine hine learning, we can also apply information theory to con contin tin tinuous uous variables specificsome probability distributions using interpretations various encodingdoschemes. In. the confield text is of where of these message length not apply apply. This mac hine learning, we areas can also apply information to continscience. uous variables fundamen fundamental tal to many of electrical engineeringtheory and computer In this where some of these message length interpretations do not apply . This field is textb textbo ook, we mostly use a few key ideas from information theory to characterize fundamental to many areas electrical engineering and science. In this probabilit probability y distributions or of quantify similarity betw etween een computer probability distributions. textb ook, detail we mostly use a few key ideas from theory to)characterize F or more on information theory theory, , see Co Cov vinformation er and Thomas (2006 or MacKa MacKay y probabilit y distributions or quantify similarity b etw een probability distributions. (2003). For more detail on information theory, see Cover and Thomas (2006) or MacKay The basic intuition behind information theory is that learning that an unlik unlikely ely (2003). ev even en entt has occurred is more informative than learning that a lik likely ely ev event ent has The basic intuitionsaying behind“the information is that learning that an unlikely occurred. A message sun rosetheory this morning” is so uninformative as ev en t has occurred is more informative than learning that a lik ely ev ent has to be unnecessary to send, but a message sa saying ying “there was a solar eclipse this o ccurred. A message saying “the sun rose this morning” is so uninformative as morning” is very informative. to be unnecessary to send, but a message saying “there was a solar eclipse this We would like to quantify information in a way that formalizes this intuition. morning” is very informative. Sp Specifically ecifically ecifically,, We would like to quantify information in a way that formalizes this intuition. Specifically • Lik Likely ely, ev even en ents ts should ha hav ve lo low w information con conten ten tent, t, and in the extreme case, ev even en ents ts that are guaranteed to happen should ha hav ve no information conten contentt Lik ely ev en ts should ha v e lo w information con ten t, and in the extreme case, whatso whatsoev ev ever. er. • events that are guaranteed to happen should have no information content • Less lik likely elyer.ev even en ents ts should ha hav ve higher information con conten ten tent. t. whatso ev lik ely ev ts should haha vevhigher information conten t. example, finding • Less Indep Independen enden endent t en ev even en ents ts should hav e additiv additive e information. For convey ey twice as • out that a tossed coin has come up as heads twice should conv Indep enden t ev en ts should ha v e additiv e information. F or example, muc uch h information as finding out that a tossed coin has come up asfinding heads out that a tossed coin has come up as heads twice should convey twice as • once. much information as finding out that a tossed coin has come up as heads In once. order to satisfy all three of these prop properties, erties, we define the self-information of an ev even en entt x = x to be In order to satisfy all three Iof(xthese ) = −prop log Perties, (x). we define the self-information (3.48) of an event x = x to be In this book, we alwa always ys use logIto . Our (x)mean = the log Pnatural (x). logarithm, with base e(3.48) definition of I( x) is therefore written in of nats. One nat is the amount of − unitsnatural In this book,gained we alwa use log to logarithm, with base e. Our 1 information byysobserving anmean even eventtthe of probability e . Other texts use base-2 definition ofand ) is therefore written in units of nats. One nat is the of I( xunits logarithms called bits or shannons ; information measured in amount bits is just by observing an even t of probability . Other texts use base-2 ainformation rescaling ofgained information measured in nats. logarithms and units called bits or shannons; information measured in bits is just When x is contin we use the same definition of information by analogy continuous, uous,measured analogy,, a rescaling of information in nats. but some of the prop properties erties from the discrete case are lost. For example, an even eventt When x is contin uous, we use the same definition of information b y analogy with unit density still has zero information, despite not being an ev even en entt that is, but some of the prop erties from the discrete case are lost. F or example, an event guaran guaranteed teed to occur. with unit density still has zero information, despite not being an event that is 73 guaranteed to occur.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
Shannon entropy in nats
0.7 0.6 0.5
Shannon entropy of a binary random variable Shannon entropy of a binary random variable
0.4 0.3 0.2 0.1 0.0 0.0
0.2
0.4
0.6
0.8
1.0
p
Figure 3.5: This plot sho shows ws ho how w distributions that are closer to deterministic hav havee lo low w Shannon entrop entropy y while distributions that are close to uniform hav havee high Shannon en entrop trop tropy y. Figure 3.5: This plot sho ws ho w distributions that are closer to deterministic hav e lo w p On the horizon horizontal tal axis, we plot , the probabilit probability y of a binary random variable being equal Shannon yywhile distributions that are to puniform high0,Shannon entropy. to 1. Theentrop en entrop trop tropy is giv given en by (p − 1) log . When phav is enear the distribution log(1 (1 (1− − p )close − p log p Onnearly the horizon tal axis, bwecause e plot the , the probabilit y of aisbinary variable equal is deterministic, random variable nearly random alwa always ys 0. When bpeing is near 1, (p 1) log(1b ecause p ) p the log prandom to 1.distribution The entropyisisnearly given by . When pvariable is near is 0, nearly the distribution the deterministic, alwa always ys 1. peristhe − − variable − the is is nearly b ecause the random nearly alwaisysuniform 0. When near 1, p =deterministic, 00..5, the en When entrop trop tropy y is maximal, because distribution ov over tw two o the distribution is nearly deterministic, b ecause the random variable is nearly always 1. outcomes. When p = 0.5, the entropy is maximal, because the distribution is uniform over the two outcomes.
Self-information deals only with a single outcome. We can quantify the amoun amountt of uncertain uncertaintty in an en entire tire probabilit probability y distribution using the Shannon entr entropy opy opy:: Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an en probabilit y )]distribution using : Htire (x) = Ex∼P [I (x P (xthe )]. Shannon entropy (3.49) = −Ex∼P [log
E E H (other x) = words, P (x [I (the x)] = also denoted H ( P ). In Shannon [log en entrop trop tropy y)]of. a distribution (3.49) is the exp expected ected amoun amountt of information in an ev even en entt−dra drawn wn from that distribution. It gives also denoted ) . In other w ords, the Shannon entrop y of2,a otherwise distribution the H ( P a lo low wer bound on the num numb ber of bits (if the logarithm is base theisunits expected amoun t of information even drawn from thatfrom distribution. It gives P. are differen different) t) needed on av average eragein toan enco encode det symbols drawn a distribution a lo w er bound on the num b er of bits (if the logarithm is base 2, otherwise the units Distributions that are nearly deterministic (where the outcome is nearly certain) P. are t) needed on averagethat to enco symbols drawnhav from a distribution ha hav vedifferen lo low w en entrop trop tropy; y; distributions are de closer to uniform have e high entrop entropy y. See Distributions that are nearlyWhen deterministic (where is nearly Fig. 3.5 for a demonstration. x is contin continuous, uous, the the outcome Shannon entrop entropy y iscertain) kno known wn havthe e lodiffer w entrop y; entr distributions that are closer to uniform have high entropy. See as differential ential entropy opy opy.. Fig. 3.5 for a demonstration. When x is continuous, the Shannon entropy is known If we hav havee two separate probability distributions P ( x) and Q(x) ov over er the same as the differential entropy. random variable x, we can measure ho how w different these two distributions are using If we hav e t w o separate probability the Kul Kullb lb lback-L ack-L ack-Leibler eibler (KL) diver divergenc genc gencee: distributions P ( x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the Kul lback-Leibler (KL) divergenc P (x)e: D KL(P kQ) = E x∼P log = E x∼P [log P (x) − log Q(x)] . (3.50) Q(x) P (x) E E D (P Q) = log = [log P (x) log Q(x)] . (3.50) Q(x) 74 k −
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
In the case of discrete variables, it is the extra amount of information (measured in bits if we use the base 2 logarithm, but in machine learning we usually use nats thenatural case of logarithm) discrete variables, the extra amount of information (measured andInthe neededittoissend a message containing symbols drawn in bits if we use the base 2 logarithm, but in machine learning w e usually use nats from probabilit probability y distribution P , when we use a co code de that was designed to minimize and the natural logarithm) needed to send a message containing symbols drawn the length of messages dra drawn wn from probabilit probability y distribution Q. from probability distribution P , when we use a code that was designed to minimize The KL div divergence ergence has many useful prop properties, erties, most notably that it is nonthe length of messages drawn from probability distribution Q. negativ negative. e. The KL divergence is 0 if and only if P and Q are the same distribution in The KLdiscrete divergence has many useful propev erties, mostinnotably it tin is uous nonthe case of variables, or equal “almost everywhere” erywhere” the casethat of con contin tinuous P Q negativ e. The KL divergence is 0 if and only if and are the same distribution in variables. Because the KL div divergence ergence is non-negativ non-negativee and measures the difference the case variables,itorisequal everywhere” in the case of contin uous b et etw w een oftwdiscrete o distributions, often“almost conceptualized as measuring some sort of vdistance ariables.bBecause the KL div ergence is non-negativ e and measures the difference etw etween een these distributions. How However, ever, it is not a true distance measure b et w een t w o distributions, it is often conceptualized measuring some of 6 DKL( QkP )asfor Q.sort because it is not symmetric: DKL( P kQ ) = some P and This distance b etw een these distributions. How ever, it is not a true distance measure asymmetry means that there are imp importan ortan ortantt consequences to the choice of whether Q ) =3.6 P ) detail. D). See ( P Fig. D for( Q because is not symmetric: for some P and Q. This to use Dit ( P k Q ) or D ( Q k P more KL KL asymmetry means that there are impkortan6 t consequences to the choice of whether k A quan quantit tit tity y that is closely related to the KL div divergence ergence cross-entr oss-entr oss-entropy opy to use D (P Q) or D (Q P ). See Fig. 3.6 for more detail.is the cr H (P, Q ) = H ( P ) + DKL (P kQ), whic which h is similar to the KL div divergence ergence but lac lacking king k that is closely k related to the KL divergence is the cross-entropy A quan tit y the term on the left: H (P, Q ) = H ( P ) + D (P HQ(P, ), whic isEsimilar to the KL divergence but lac king Q) =h − (3.51) x∼P log Q(x). the term on the left: k Eect to Q is equiv Minimizing the cross-entrop cross-entropy resp respect equivalent the H y(P,with Q) = log Q(x ). alent to minimizing (3.51) KL div divergence, ergence, because Q do does es not participate omitted term. − ect to Qinisthe Minimizing the cross-entropy with resp equivalent to minimizing the computing man many ofesthese quan quantities, tities, it common encoun encounter ter expresKL When divergence, because Qydo not participate in isthe omittedtoterm. sions of the form 0 log 0. By con conv ven ention, tion, in the con context text of information theory theory,, we When computing man y of these quan tities, it is common to encoun ter exprestreat these expressions as limx→0 x log x = 00.. sions of the form 0 log 0. By convention, in the context of information theory, we treat these expressions as lim x log x = 0.
3.14
Structured Probabilistic Mo Models dels
Mac Machine hine Structured learning algorithms often in inv volv olvee probabilit probability y distributions ov over er a very 3.14 Probabilistic Models large num umb ber of random variables. Often, these probabilit probability y distributions in inv volv olvee Mac hine learning algorithms often in v olv e probabilit y distributions ov er a very direct in interactions teractions betw etween een relatively few variables. Using a single function to large num ber entire of random ariables. Often, these probabilit distributions volve describ describe e the join jointt vprobabilit probability y distribution can be yvery inefficien inefficienttin(both direct interactions een relatively few variables. Using a single function to computationally andbetw statistically). describe the entire joint probability distribution can be very inefficient (both Instead of using a single function to represen representt a probability distribution, we computationally and statistically). can split a probability distribution in into to man many y factors that we multiply together. of supp using a we single to represen t a probability we For Instead example, suppose ose ha hav vfunction e three random variables: a, b and cdistribution, . Supp Suppose ose that split a probability into manthe y factors together. acan influences the value ofdistribution b and b influences value that of c, we butmultiply that a and c are F or example, supp ose we ha v e three random v ariables: a , b and c . Supp ose that indep independen enden endentt giv given en b. We can represent the probabilit probability y distribution over all three a influences the value of b and b influences the value of c, but that a and c are independent given b. We can represent the probability distribution over all three 75
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
q ∗ = argminq D KL(pkq ) (pp(xq)) q∗k(x) p(x) q (x)
q = argmin D Probability Density
Probability Density
q = argmin D
q ∗ = argmin q DKL (q kp)
x
(p( q xp)) q ∗k(x) p(x) q (x)
x
Figure 3.6: The KL divergence is asymmetric. Suppose we ha hav ve a distribution p(x) and wish to approximate it with another distribution q ( x). We hav havee the choice of minimizing p(x) and Figure D 3.6: divergence iseasymmetric. havcehoice a distribution either or D illustrate theSuppose effect ofwethis using a mixture of pkq) KL KL (The KL ( q kp). W ) q ( x wish to approximate it with another distribution . W e hav e the c hoice of minimizing two Gaussians for p, and a single Gaussian for q . The choice of whic which h direction of the D (p qto ( q p). We illustrate ) or either the effect of this crequire hoice using a mixture of KL divergence useDis problem-dep problem-dependen enden endent. t. Some applications an approximation two Gaussians for p,high and aprobabilit for qthat . Thethe choice which direction the k single Gaussian that usually kplaces probability y anywhere true ofdistribution placesofhigh KL divergence to use is problem-dep enden t. Some applications require an approximation probabilit probability y, while other applications require an appro approximation ximation that rarely places high that usually places y anywhereplaces that the distribution placesofhigh probabilit probability y an anywhere ywherehigh that probabilit the true distribution low true probabilit probability y. The choice the probabilitofy, the while applications requireofan appro ximation that rarely places direction KLother div divergence ergence reflects which these considerations takes priorit priority y for high eac each h probabilit y an ywhere that the true distribution places low probabilit y . The choice of the application. (L (Left) eft) The effect of minimizing DKL ( pkq). In this case, we select a q that has direction of theyKL divergence reflects which yof. these priorit y for p has high p has multipletakes q cho high probabilit probability where probabilit probability Whenconsiderations mo modes, des, hooses oseseac toh (Left) ( p q). yInmass q that The application. The effect of minimizing this on case, select a(Right) has blur the modes together, in order to put highDprobabilit probability allwofe them. p (has high probabilit y where . When multiple moprobability des, q cho oses to k pa has DKL q that effect of minimizing . In probabilit this case, ywe select has low where q kp)high (Right) together, in order tomput highmo probabilit mass on all of The hasthe lo low wmodes probabilit probability y. When ultiple modes des thaty are sufficien sufficiently tlythem. widely separated, pblur p has q that ahas ). In this q pergence effect of minimizing case, we select low mo probability where as in this figure, theDKL (div divergence is minimized by cahoosing single mode, de, in order to p p has lo w probabilit y . When has m ultiple mo des that are sufficien tly widely separated, k avoid putting probability mass in the low-probabilit low-probability y areas b et etw ween mo modes des of p. Here, we as in this the figure, the KL divergence is minimized by choosing a singleWmo de, inalso order to q is chosen illustrate outcome when to emphasize the left mode. e could hav have e avhiev oid putting probability low-probabilit y areas bthe etwright een mo des ofIfpthe . Here, we ac achiev hieved ed an equal value ofmass the in KLthe div divergence ergence by choosing mode. mo modes des illustrate the outcome when qnis emphasize the left mode. We could alsoofhav are not separated by a sufficie sufficien tlychosen strongto lo low w probabilit probability y region, then this direction thee ac hiev ed an equal v alue of the KL div ergence b y c hoosing the right mode. If the mo des KL div divergence ergence can still choose to blur the mo modes. des. are not separated by a sufficiently strong low probability region, then this direction of the KL divergence can still choose to blur the mo des.
76
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
variables as a pro product duct of probability distributions ov over er tw two o variables: variables as a product of pprobability (a, b, c) = pdistributions (a)p(b | a)p(cov| er b).two variables:
(3.52)
p(a, b, c) = p(a)p(b a)p(c b). (3.52) These factorizations can greatly reduce the num numb ber of parameters needed | a num | ber of parameters that is to describ describee the distribution. Each factor uses numb These factorizations can greatly reduce the num of parameters exp exponen onen onential tial in the num number ber of variables in the factor. Thisber means that we can needed greatly to describ the of distribution. factor uses ber of that is reduce theecost representing Each a distribution if wae num are able to parameters find a factorization exp tial in the num of in into toonen distributions over ber few fewer er vvariables ariables.in the factor. This means that we can greatly reduce the cost of representing a distribution if we are able to find a factorization We can describ describee these kinds of factorizations using graphs. Here we use the into distributions over fewer variables. word “graph” in the sense of graph theory: a set of ve vertices rtices that may be connected Weh can describ these kinds graphs. Here we use the to eac each other with eedges. Whenofwefactorizations represent theusing factorization of a probability w ord “graph” with in thea sense a seteof veob rtices that mo may e connected distribution graph,ofwgraph e call theory: it a structur structure d pr prob obabilistic abilistic model delbor gr graphic aphic aphical al to eac h other with edges. When we represent the factorization of a probability mo model del del.. distribution with a graph, we call it a structured probabilistic model or graphical There are two main kinds of structured probabilistic mo models: dels: directed and model. undirected. Both kinds of graphical mo models dels use a graph G in which each no node de There are t w o main kinds of structured probabilistic mo dels: directed and in the graph corresp corresponds onds to a random variable, and an edge connecting tw two o undirected. Both kinds of graphical mo dels use a graph in which each no de random variables means that the probability distribution is able to represen representt direct in the graph corresp onds to a random v ariable, and an edge connecting two G in interactions teractions bet etw ween those two random variables. random variables means that the probability distribution is able to represent direct Dir Direecte cted d mo models dels use graphs with directed edges, and they represen representt factorizainteractions between those two random variables. tions into conditional probability distributions, as in the example ab abov ov ove. e. Sp Specifically ecifically ecifically,, Dir e cte d mo dels use graphs with directed edges, and they represen t factorizaa directed mo model del con contains tains one factor for every random variable ariablex xi in the distribution, tions into conditional probability distributions, as in the example ab e. Sp ecifically and that factor consists of the conditional distribution ov over er x i giv given enovthe paren parents ts of , for every random variable x in the distribution, xaidirected , denotedmo Pdel a G (con xi ):tains one factorY and that factor consists of the conditional distribution over x given the parents of p(x) = p (xi | P aG (xi )) . (3.53) x , denoted P a (x ): i p (x ) = p (x P a (x )) . (3.53) See Fig. 3.7 for an example of a directed graph and the factorization of probability | distributions it represen represents. ts. See Fig. 3.7 for an example of a directed graph and the factorization of probability Undir Undireecte cted d mo models dels use graphs with undirected edges, and they represen representt facdistributions it represents. Y torizations in into to a set of functions; unlik unlikee in the directed case, these functions are Undir e cte d mo dels use graphs with and t facusually not probability distributions ofundirected any kind. edges, An Any y set of they no nodes desrepresen that are all torizations to into a set of functions; unlik in the Each directed case, functions are ) in an connected each other in G is called a eclique. clique undirected C (ithese usually probability kind. factors Any setare of just nodes that arenot all ) C (iany mo model del isnot asso associated ciated withdistributions a factor φ(i) (of ). These functions, connectedytodistributions. each other inThe is output called aofclique. Each clique in an undirected probabilit probability each factor must be non-negative, but φ ( mo del is asso ciated with a factor ) . These factors are just functions, not G C there is no constraint that the factor must sum or integrate to 1 like a probability probability distributions. The outputCof each factor must be non-negative, but distribution. there is no constraint that the factor must sum or integrate to 1 like a probability The probability of a configuration of random variables is pr prop op oportional ortional to the distribution. pro product duct of all of these factors—assignments that result in larger factor values are The probability of a configuration of random variables is proportional to the 77 product of all of these factors—assignments that result in larger factor values are
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
a
b
c
d
e
Figure 3.7: A directed graphical mo model del ov over er random variables a , b, c , d and e. This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.7: A directed graphical model over random variables a , b, c , d and e. This graph p(a, b, cy, d , e) = p(a)p(b | a)can p(c |bae, b )p(d | b)as p(e | c). (3.54) corresp onds to probabilit distributions that factored p(ato , b,quickly (d distribution. This graph allo allows ws us seep(some of)pthe a c, d, e) = a)p(b properties a)p(c a, b b)p(e c). For example, (3.54) and c in interact teract directly directly,, but a and e in interact teract | only indirectly | |via c. | This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.
more lik likely ely ely.. Of course, there is no guarantee that this pro product duct will sum to 1. We therefore divide by a normalizing constant Z, defined to be the sum or integral likstates ely. Ofofcourse, there guarantee thatinthis proto duct will sum to 1. We φ functions, omore ver all the pro product ductisofno the order obtain a normalized therefore by a normalizing constant Z, defined to be the sum or integral probabilit probability ydivide distribution: over all states of the product of the φ functions, in order to obtain a normalized 1 Y (i) (i) probability distribution: p(x) = φ . (3.55) C Z i 1 p(x) = φ . (3.55) See Fig. 3.8 for an example of anZundirectedC graph and the factorization of probabilit probability y distributions it represen represents. ts. See Fig. 3.8 for an example of an undirected graph and the factorization of Keep in mind that these graphical representations of factorizations are a probability distributions it represents. Y language for describing probability distributions. They are not mutually exclusive Keepofin mind that these graphical of factorizations are ay families probabilit probability y distributions. Beingrepresentations directed or undirected is not a prop propert ert erty language for describing probability distributions. They are not m utually exclusive of a probability distribution; it is a prop property erty of a particular description of a families of probabilit y distributions. Being directed or undirected is not aedprop probabilit probability y distribution, but an any y probability distribution may be describ described in bert othy ofays. a probability distribution; it is a property of a particular description of a w probability distribution, but any probability distribution may be described in both Throughout Part I and Part II of this book, we will use structured probabilistic ways. mo models dels merely as a language to describ describee which direct probabilistic relationships Throughout P art I and P art I I of choose this book, we will useNostructured probabilistic differen differentt mac machine hine learning algorithms to represent. further understanding mostructured dels merelyprobabilistic as a language to describ e which probabilistic relationships of mo models dels is needed un until tildirect the discussion of researc research h topics, differen t mac hine learning algorithms choose to represent. No further understanding in Part III, where we will explore structured probabilistic mo models dels in muc much h greater of structured probabilistic mo dels is needed un til the discussion of researc h topics, detail. in Part III, where we will explore structured probabilistic models in much greater 78 detail.
CHAPTER 3. PROBABILITY AND INFORMATION THEORY
a
b
c
d
e
Figure 3.8: An undirected graphical model ov over er random variables a , b, c, d and e . This graph corresp corresponds onds to probabilit probability y distributions that can b e factored as Figure 3.8: An undirected graphical model over random variables a , b, c, d and e . This 1 (1) (3) graph corresp onds top(probabilit distributions can φ (a, b,that a, b, c, d, ey) = c)φ(2) (b,bde)φfactored (c, e). as (3.56) Z 1 φ properties = some (a, b, c)φ of (bthe , d)φdistribution. (c, e). a, quickly b, c, d, e)see (3.56) This graph allo allows ws usp(to For example, a Z and c in interact teract directly directly,, but a and e in interact teract only indirectly via c. This graph allows us to quickly see some properties of the distribution. For example, a and c interact directly, but a and e interact only indirectly via c.
This chapter has reviewed the basic concepts of probabilit probability y theory that are most relev relevant ant to deep learning. One more set of fundamen fundamental tal mathematical to tools ols This c hapter has reviewed the basic concepts of probabilit y theory that are remains: numerical metho methods. ds. most relevant to deep learning. One more set of fundamental mathematical tools remains: numerical methods.
79
Chapter 4 Chapter 4
Numerical Computation Numerical Computation Mac Machine hine learning algorithms usually require a high amoun amountt of numerical computation. This typically refers to algorithms that solve mathematical problems by Mac hine learning algorithms require avia high t of numerical compumetho methods ds that update estimatesusually of the solution an amoun iterative pro process, cess, rather than tation. This t ypically refers to algorithms that solve mathematical problems by analytically deriving a form formula ula providing a symbolic expression for the correct sometho ds that update estimates of the solution via an iterative pro cess, rather than lution. Common op operations erations include optimization (finding the value of an argument analytically deriving a formulaa providing symbolic the correct sothat minimizes or maximizes function) aand solvingexpression systems offor linear equations. lution. Common operations include optimization value of anbargument Ev Even en just ev evaluating aluating a mathematical function on a(finding digital the computer can e difficult that minimizes or maximizes a function) and solving systems of linear equations. when the function inv involv olv olves es real num numbers, bers, whic which h cannot be represented precisely Ev en just ev aluating a mathematical function on a digital computer can be difficult using a finite amount of memory memory.. when the function involves real numbers, which cannot be represented precisely using a finite amount of memory.
4.1
Ov Overflo erflo erflow w and Underflo Underflow w
The fundamental tal w difficulty performingwcontin continuous uous math on a digital computer 4.1 fundamen Overflo and in Underflo is that we need to represent infinitely many real num numb bers with a finite num number ber Thebitfundamen tal This difficulty in pthat erforming continall uous math on a digital computer of patterns. means for almost real num numbers, bers, w wee incur some is thatximation we neederror to represent a finiteIn num ber appro approximation when weinfinitely represen representmany t the nreal um umb bnum er inbers thewith computer. man many y of bit patterns. This means that for almost all real num bers, w e incur some cases, this is just rounding error. Rounding error is problematic, esp especially ecially when appro ximation error when w e represen t the n um b er in the computer. maniny it compounds across man many y op operations, erations, and can cause algorithms that In work cases, this is just rounding error.are Rounding errortois minimize problematic, especially when theory to fail in practice if they not designed the accum accumulation ulation of it compounds rounding error.across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of One form of rounding error that is particularly dev devastating astating is underflow. Underrounding error. flo flow w occurs when num numbers bers near zero are rounded to zero. Many functions behav ehavee One form of rounding error that is particularly dev astating is underflow . Underqualitativ differen when their argument is zero rather than a small p ositive qualitatively ely differently tly flo w boer. ccurs numbers near zero are rounded to zero. by Many behav n um umb Forwhen example, we usually wan want t to av avoid oid division zerofunctions (some softw software aree qualitatively differently when their argument is zero rather than a small positive number. For example, we usually want80 to avoid division by zero (some software 80
CHAPTER 4. NUMERICAL COMPUTATION
en environmen vironmen vironments ts will raise exceptions when this occurs, others will return a result with a placeholder not-a-n not-a-num um umb ber value) or taking the logarithm of zero (this is en vironmen ts will raise exceptions thisnot-a-n occurs,um others a result usually treated as −∞, whic which h then when becomes not-a-num umb ber if will it is return used for many with a placeholder not-a-n um b er v alue) or taking the logarithm of zero (this is further arithmetic op operations). erations). usually treated as , which then becomes not-a-number if it is used for many Another highly damaging form of numerical error is overflow. Overflo Overflow w occurs further arithmetic −∞ operations). when num umbers bers with large magnitude are appro approximated ximated as ∞ or −∞. Further Anotherwill highly damaging numerical error is overflow . Overflo w occurs arithmetic usually changeform theseofinfinite values into not-a-num not-a-number ber values. when numbers with large magnitude are approximated as or . Further One example of a function that must be stabilized against underflow and arithmetic will usually change these infinite values into not-a-num ber v ∞ −∞ alues. overflo erflow w is the softmax function. The softmax function is often used to predict the One example of a function that must distribution. be stabilizedThe against underflow and probabilities asso associated ciated with a multinoulli softmax function is odefined verflowto is bthe softmax function. The softmax function is often used to predict the e probabilities associated with a multinoulli distribution. The softmax function is exp( exp(x xi ) softmax(x)i = Pn . (4.1) defined to be exp(x xj ) j=1 exp( exp(x ) softmax( x) x= are equal to some . constantt c. Analytically (4.1), Consider what happ happens ens when all of the Analytically, i exp(x ) constan we can see that all of the outputs should be equal to 1n . Numerically Numerically,, this may x Consider what happ ens when all of the are equal to some constan t c. Analytically exp exp((c) will, not occur when c has large magnitude. If c is very negativ negative, e, then w e can see that means all of the should be equal to will . Numerically , this underflo underflow. w. This the outputs denominator of the softmax become 0, so the may final c c exp ( c not occur when has large magnitude. If is v ery negativ e, then ) will exp((c) will ov result is undefined. When c is very large P and positiv ositive, e, exp overflo erflo erflow, w, again underflow.inThis means the denominator of theundefined. softmax will become 0, so the final resulting the expression as a whole being Both of these difficulties c exp ( c result is undefined. When is very large and p ositiv e, ) will ov erflo w, again softmax((z ) where z = x − max i xi . Simple can be resolved by instead ev evaluating aluating softmax resultingshows in the expression whole beingfunction undefined. of these difficulties algebra that the valueasofathe softmax is notBoth changed analytically by softmax z ) where z = x max can be resolved by instead evaluating . Simple maxx x adding or subtracting a scalar from the input (vector. Subtracting results i i algebra shows argument that the value of bthe softmax function is not by − analytically in the largest to exp eing 0, whic which h rules out thechanged possibility of ov overflo erflo erflow. w. max x adding or subtracting a scalar from the input vector. Subtracting results Lik Likewise, ewise, at least one term in the denominator has a value of 1, which rules out in the largest argument to exp being 0, which rules out the ossibilityby of zero. overflow. the possibility of underflow in the denominator leading to apdivision Likewise, at least one term in the denominator has a value of 1, which rules out There is still one small problem. Underflow in the numerator can still cause the possibility of underflow in the denominator leading to a division by zero. the expression as a whole to ev evaluate aluate to zero. This means that if we implement There is still one small problem. Underflow in the numerator can cause log softmax softmax((x) by first running the softmax subroutine then passing thestill result to the expression as waewhole evaluate to zero. −∞ This means that if we implement implement the log function, could to erroneously obtain . Instead, we must softmaxfunction (x) by first running the softmax subroutine then passing thewa result to alogseparate that calculates in a numerically stable way y. The log softmax the softmax log function, we could obtain . Instead, log function can beerroneously stabilized using the same trick aswe we must used implement to stabilize a separate function that calculates in a n umerically stable way. The log softmax −∞ the softmax function. log softmax function can be stabilized using the same trick as we used to stabilize For the most part, we do not explicitly detail all of the numerical considerations the softmax function. in inv volv olved ed in implementing the various algorithms describ described ed in this book. Developers F or the most part, we do not explicitly detail all of the numerical of low-lev low-level el libraries should keep numerical issues in mind when considerations implementing in v olv ed in implementing the v arious algorithms describ ed in this b o ok. Developers deep learning algorithms. Most readers of this book can simply rely on lowof low-lev el libraries should k eep numerical issues in mind when implementing lev level el libraries that provide stable implementations. In some cases, it is possible deep learning aalgorithms. Most of this ook can simplyautomatically rely on lowto implement new algorithm andreaders hav havee the new b implementation level libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and hav81 e the new implementation automatically
CHAPTER 4. NUMERICAL COMPUTATION
stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example of a softw software are pack package age that automatically detects and stabilizes man many y common stabilized. Theano ( Bergstra et al. , 2010 ; Bastien et al. , 2012 ) is an example numerically unstable expressions that arise in the context of deep learning. of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.
4.2
Poor Conditioning
Conditioning to how rapidly a function changes with resp respect ect to small changes 4.2 Poorrefers Conditioning in its inputs. Functions that change rapidly when their inputs are perturb erturbed ed slightly Conditioning refers to how rapidly a function c hanges with resp ect to small can be problematic for scientific computation because rounding errors in thechanges inputs in its inputs. F unctions that change rapidly when their inputs are p erturb ed slightly can result in large changes in the output. can be problematic for scientific computation because rounding errors in the inputs Consider the function f ( x ) = A−1x. When A ∈ R n×n has an eigenv eigenvalue alue can result in large changes in the output. decomp decomposition, osition, its condition numb number er is R A Consider the function f ( x ) = A x has an eigenvalue . When decomposition, its condition number is λi ∈ (4.2) max . i,j λj λ max . (4.2) This is the ratio of the magnitude of the λlargest and smallest eigen eigenv value. When this num numb ber is large, matrix inv inversion ersion is particularly sensitive to error in the input. This is the ratio of the magnitude of the largest and smallest eigenvalue. When This sensitivit sensitivity y is an in intrinsic trinsic prop propert ert erty y of the matrix itself, not the result this number is large, matrix inversion is particularly sensitive to error in the input. Poorly conditioned matrices amplify of rounding error during matrix inv inversion. ersion. This sensitivit is anwe intrinsic prop of the matrix itself,Innot the result pre-existing errorsywhen multiply by ert they true matrix inv inverse. erse. practice, the of rounding error during further matrix binv Poorly matrices amplify error will be comp compounded ounded y nersion. umerical errorsconditioned in the in inv version pro process cess itself. pre-existing errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.
4.3
Gradien Gradient-Based t-Based Optimization
Most learningt-Based algorithms Optimization in inv volv olvee optimization of some sort. Optimization 4.3 deep Gradien refers to the task of either minimizing or maximizing some function f (x ) by altering Most learning algorithms involve optimization some of sort. Optimization x f (x). . W Weedeep usually phrase most optimization problems inofterms minimizing f ( x refers to the task of either minimizing or maximizing some function ) b y altering Maximization ma may y be accomplished via a minimization algorithm by minimizing x −.f (W xe). usually phrase most optimization problems in terms of minimizing f (x). Maximization may be accomplished via a minimization algorithm by minimizing antt to minimize or maximize is called the obje objective ctive function f (The x). function we wan or criterion. When we are minimizing it, we may also call it the cost function function,, − The function we want to minimize or maximize is called the objective function loss function, or err error or function. In this book, we use these terms in interc terc terchangeably hangeably hangeably,, or criterion . When we are minimizing it, w e may also call it the c ost though some machine learning publications assign sp special ecial meaning to somefunction of these, loss function, or error function. In this book, we use these terms interchangeably, terms. though some machine learning publications assign special meaning to some of these We often denote the value that minimizes or maximizes a function with a terms. sup superscript erscript ∗. For example, we might say x∗ = arg min f (x). We often denote the value that minimizes or maximizes a function with a superscript . For example, we might say 82 x = arg min f (x). ∗
CHAPTER 4. NUMERICAL COMPUTATION
Gradient descent
2.0
Global minimum Gradient descentat x =0. 0
1.5
Since f (x) =0, gradient descent halts here.
1.0 0.5 0.0 −0.5
For x <0, we have f0(x) <0, so we can decrease f by moving rightward.
For x >0, we have f 0(x) >0, so we can decrease f by moving leftward.
−1.0
f(x) = 12 x2
−1.5 −2.0 −2.0
ff0((xx))= = xx −1.5
−1.0
−0.5
0.0 x
0.5
=x 1.0 f (x)1.5
2.0
Figure 4.1: An illustration of how the deriv derivativ ativ atives es of a function can b e used to follow the function downhill to a minimum. This tec technique hnique is called gr gradient adient desc descent ent ent.. Figure 4.1: An illustration of how the derivatives of a function can b e used to follow the function downhill to a minimum. This technique is called gradient descent.
We assume the reader is already familiar with calculus, but pro provide vide a brief review of how calculus concepts relate to optimization here. We assume the reader is already familiar with calculus, but provide a brief = f (to x),optimization Supp Suppose wecalculus hav havee a concepts function yrelate where both x and y are real num numbers. bers. review ofose how here. dy 0 The derivative of this function is denoted as f ( x) or as dx . The deriv derivative ative f 0 (x) y = f ( x x y Supp ose we hav e a function ) , where b oth and are real bers. giv gives es the slop slopee of f (x) at the point x. In other words, it sp specifies ecifies hownum to scale (x) derivative denoted as f ( x ) orcorresp as .onding The deriv ativeinf the aThe small change of in this the function input in is order to obtain the corresponding change gives thef (slop ) )at . x. In other words, it specifies how to scale output: x +e)of≈ff(x(x + the f 0(xp)oint a small change in the input in order to obtain the corresponding change in the The deriv derivative ative for minimizing a function because it tells us x). output: f (x + ) isf therefore (x) + f (useful ho how w to change x in order to make a small improv improvemen emen ementt in y . For example, we ≈is therefore The deriv ative useful for minimizing a function because tells us 0 . We it kno know w that f(x − sign (f ( x))) is less than f (x ) for small enough can thus x in order how to fchange makesteps a small emen t in . Fderiv or example, we (x) by moving x intosmall reduce withimprov opp opposite osite sign of ythe derivativ ativ ative. e. This x sign f ( ( f ( x f ( x kno w that ))) is less than ) for small enough . W e can thus tec technique hnique is called gr gradient adient desc descent ent (Cauc Cauch hy, 1847). See Fig. 4.1 for an example of f (x) by − reduce moving x in small steps with opposite sign of the derivative. This this technique. technique is0 called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of f (x) = 00,, the deriv derivative ative provides no information ab about out which direction thisWhen technique. 0 to mov move. e. Poin Points ts where f (x) = 0 are known as critic critical al points or stationary points. f ( x When ) = 0 , the deriv ative provides no information out which direction A lo loccal minimum is a point where f ( x) is low lower er than at allabneighboring poin oints, ts, f (x to mov ts where 0 are known critical infinitesimal points or stationary f (x ) bas so it is e. noPoin longer possible to) = decrease y making steps. Apoints lo loccal. f ( x) isthan A local minimum ist awhere pointf (where loweratthan at all neighboring poin maximum is a poin oint all neigh neighb boring poin oints, ts, so it ts, is x ) is higher f ( x so it is no longer possible to decrease ) by making infinitesimal steps. A local maximum is a point where f (x ) is higher than at all neighboring points, so it is 83
CHAPTER 4. NUMERICAL COMPUTATION
Types of critical points Minimum
Types Maximum of critical points
Saddle point
Minimum
Maximum
Saddle point
Figure 4.2: Examples of each of the three typ ypes es of critical poin oints ts in 1-D. A critical point is a p oint with zero slop slope. e. Such a p oin ointt can either b e a lo local cal minimum, which is low lower er than Figure 4.2: Examples of each of the three t yp es of critical p oin ts in 1-D. A critical p oint or is the neighboring p oints, a lo local cal maximum, which is higher than the neigh neighb b oring p oints, a saddle p oint with zero slop e. Such a p oin t can either b e a lo cal minimum, which is low er than p oint, which has neighbors that are b oth higher and low lower er than the p oin ointt itself. the neighboring p oints, a lo cal maximum, which is higher than the neighb oring p oints, or a saddle p oint, which has neighbors that are b oth higher and lower than the p oint itself.
not possible to increase f( x) by making infinitesimal steps. Some critical points are neither maxima nor minima. These are known as sadd saddle le points oints.. See Fig. 4.2 f ( x not p ossible to increase ) b y making infinitesimal steps. Some critical points for examples of each type of critical point. are neither maxima nor minima. These are known as sadd le points. See Fig. 4.2 A point that obtains lowest est value of f ( x) is a glob global al minimum minimum.. It for examples of each typethe of absolute critical plow oint. is possible for there to be only one global minim minimum um or multiple global minima of f ( x) is athat point that the absolute lowest of minima globare al minimum . It the A function. It isobtains also possible for there to bvealue lo local cal not globally is possibleInforthe there to beofonly one global minim um or multiple optimal. context deep learning, we optimize functionsglobal that minima may ha hav vofe the function. It is also p ossible for there to b e lo cal minima that are not globally man many y lo local cal minima that are not optimal, and man many y saddle points surrounded by optimal. In the context of deep we optimize functions that may ve v ery flat regions. All of this makeslearning, optimization very difficult, esp especially ecially whenhathe many to local are not optimal, and many saddle points surrounded input theminima functionthat is multidimensional. We therefore usually settle for findingby a v ery flat regions. All of this makes optimization very difficult, esp ecially when the value of f that is very low, but not necessarily minimal in any formal sense. See input4.3 to for thean function is multidimensional. We therefore usually settle for finding a Fig. example. value of f that is very low, but not necessarily minimal in any formal sense. See We often minimize functions that ha hav ve multiple inputs: f : R n → R. For the Fig. 4.3 for an example. concept of “minimization” to make sense, there must still be R only one R (scalar) f : W e often minimize functions that ha v e m ultiple inputs: . For the output. concept of “minimization” to make sense, there must still be only→one (scalar) For functions with multiple inputs, we must mak makee use of the concept of partial output. ∂ f ( x derivatives derivatives.. The partial deriv derivativ ativ ativee ∂x ) measures how f changes as only the i F or functions with multiple inputs, we must mak e use of the concept of partial variable xi increases at point x. The gr gradient adient generalizes the notion of deriv derivativ ativ ativee f ( x f derivatives . The partial deriv ativ e ) measures how c hanges as only to the case where the deriv derivativ ativ ativee is with resp respect ect to a vector: the gradient of f is the the x x v ariable increases at p oint . The gr adient generalizes the notion of deriv ativ vector containing all of the partial deriv derivativ ativ atives, es, denoted ∇xf ( x). Elemen Elementt i of thee to the case the deriv ativ e eis of with respect toect a vector: gradient of f is the f with gradien gradient t is where the partial deriv derivativ ativ ative resp respect to xi . Inthe multiple dimensions, vector containing all of the partial derivatives, denoted f ( x). Element i of the gradient is the partial derivative of f with . In multiple dimensions, 84 resp ect to x∇
CHAPTER 4. NUMERICAL COMPUTATION
Approximate minimization This local minimum
Approximate minimization performs nearly as well as f(x)
the global one, so it is an acceptable halting point.
Ideally, we would like to arrive at the global minimum, but this might not be possible.
This local minimum performs poorly, and should be avoided.
x
Figure 4.3: Optimization algorithms ma may y fail to find a global minimum when there are multiple lo local cal minima or plateaus present. In the context of deep learning, we generally Figure such 4.3: Optimization y fail findminimal, a globalso minimum whencorresp there ond are accept solutions ev even en algorithms though theyma are not to truly long as they correspond m ultiple lo cal minima or plateaus present. In the context of deep learning, we generally to significantly low values of the cost function. accept such solutions even though they are not truly minimal, so long as they corresp ond to significantly low values of the cost function.
critical points are points where ev every ery element of the gradient is equal to zero. u The pdir dire ectional derivative vector) is is equal the slop slope of the critical oints are points whereinevdirection ery element (a of unit the gradient to ezero. function f in direction u. In other words, the directional deriv derivative ative is the deriv derivativ ativ ativee u The dir e ctional derivative in direction (a unit vector) is the slop e of the of the function f ( x + αu) with resp respect ect to α , ev evaluated aluated at α = 00.. Using the chain function in see direction derivative is the derivative ∂u. In other words, ∇x fdirectional f (x + αu) = u> the rule, we fcan that ∂α (x). of the function f ( x + αu) with respect to α , evaluated at α = 0. Using the chain f , we would lik direction in which f decreases the f (x + like αue)to = find u the rule,Toweminimize can see that f (x ). fastest. We can do this using the directional deriv derivativ ativ ative: e: To minimize f , we would like to find∇the direction in which f decreases the fastest. We can do this using the directional derivative: min u>∇ x f (x) (4.3) u,u > u=1
f (x) (4.3) min min ||u||u (4.4) 2 ||∇ xf (x)||2 cos θ > u,u u=1 ∇ = min u f (x) Substituting cos θ where θ is the angle betw etween een u and the gradient. in ||u||2 = 1 (4.4) and || || ||∇ || ignoring factors that do not dep depend end on u, this simplifies to minu cos θ. This is θ u where is the angle b etw een and the gradient. Substituting in ut. = and u minimized when poin oints ts in the opp opposite osite direction as the gradien gradient. In1other ignoring that dotsnot depend on uand , this to min ||pcos is ||θ. This w ords, thefactors gradient poin oints directly uphill, thesimplifies negative gradient oints directly minimized points finbythe opposite direction as ofthe t. gradient. In other do downhill. wnhill. Wwhen e can udecrease moving in the direction thegradien negative w ords, the gradient p oin ts directly uphill, and the negative gradient p oints directly This is known as the metho method d of ste steep ep epest est desc descent ent or gr gradient adient desc descent ent. downhill. We can decrease f by moving in the direction of the negative gradient. Steepest est descent proposes oses a new oin oint t ent or gradient descent. ThisSteep is known as theprop metho d of steepp est desc =
0 Steepest descent proposes axnew poin =x − t∇x f (x)
(4.5)
x = x 85 f (x) − ∇
(4.5)
CHAPTER 4. NUMERICAL COMPUTATION
where is the le learning arning rate ate,, a positiv ositivee scalar determining the size of the step. We can cho hoose ose in sev several eral differen differentt wa ways. ys. A popular approach is to set to a small where is the le arning r ate , a p ositiv e scalar determining size of step. We constan constant. t. Sometimes, we can solve for the step size thatthe makes thethedirectional can cative hoosevanish. in several differen t ways. isAto popular approach set to a small f (x − is ∇to deriv derivative Another approach ev evaluate aluate xf (x)) for several constan t. Sometimes, we can solve for the step size that makes directional values of and choose the one that results in the smallest ob objective jectivethe function value. derivative vanish. isAnother is. to evaluate f (x f (x)) for several This last strategy called aapproach line se sear ar arch ch ch. values of and choose the one that results in the smallest ob function value. − jective ∇ Steep Steepest est descen descentt con conv verges when every element of the gradient is zero (or, in This last strategy is called a line search. practice, very close to zero). In some cases, we may be able to avoid running this Steep descent and convjust ergesjump whendirectly every element the gradient is solving zero (or,the in iterativ iterative e est algorithm, to the of critical point by practice, very zero). equation ∇xf (close x) = to 0 for x. In some cases, we may be able to avoid running this iterative algorithm, and just jump directly to the critical point by solving the Although gradient descent is limited to optimization in contin continuous uous spaces, the equation f (x) = 0 for x. general concept of making small mo mov ves (that are appro approximately ximately the best small mo mov ve) ∇ gradient descent is limited to optimization in continuous spaces, the Although to tow wards better configurations can be generalized to discrete spaces. Ascending an general concept of making small moves (that are appro best and smallNorvig move), ob objectiv jectiv jective e function of discrete parameters is called hil hilll ximately climbing the (Russel towards 2003 2003). ). better configurations can be generalized to discrete spaces. Ascending an ob jective function of discrete parameters is called hil l climbing (Russel and Norvig, 2003).
4.3.1
Bey Beyond ond the Gradien Gradient: t: Jacobian and Hessian Matrices
4.3.1 Bey Gradien t: Jacobian and Matrices Sometimes weond needthe to find all of the partial deriv derivativ ativ atives esHessian of a function whose input and output are both vectors. The matrix containing all suc such h partial deriv derivatives atives is Sometimes need find .allSp ofecifically the partial deriv ativ of a function f : R m whose → Rn, input kno known wn as a we Jac Jacobian obianto matrix matrix. Specifically ecifically, , if we hav have e a es function then and output are b oth vectors. The matrix containing all suc h partial deriv atives is ∂ n×m of f is defined such that Ji,j = Rj f (x) i .R the Jacobian matrix J ∈ R : known as a Jacobian matrix. Specifically, if we have a function f ∂x , then Rin e are also sometimes interested terested in defined a deriv derivativ ativ ative e of a deriv derivative. ative. fThis known wn (x→ ) .is kno of f is such that Jn = the W Jacobian matrix J as a se seccond derivative derivative.. For example, for a function f : R → R, the deriv derivativ ativ ativee ∈ interested in a derivative of a derivative. This is kno W e are also sometimes ∂ 2 wn with resp respect ect to x i of the deriv derivativ ativ ativee of f with resp respect ect to xRj is denoted as ∂xi ∂xj f . R f : as a second derivative. For example, for a function , the deriv ative d2 00 f f ( x In a single dimension, we can denote b y ) . The second deriv derivativ ativ ative e tells 2 f with respect to x is→ f. with respect to x of the derivative of dx denoted as us how the first deriv derivativ ativ ativee will change as we vary the input. This is imp important ortant (x). as Inecause a single dimension, we can denote stepf will by fcause The second deriv ativeemen tellst b it tells us whether a gradient muc uch h of an improv improvemen ement us we howwould the first deriv ative will change as wealone. vary the Thisofisthe impsecond ortant as exp expect ect based on the gradient We input. can think b ecause it tells us whether a gradient step will cause as m uc h of an improv ement deriv derivative ative as measuring curvatur curvaturee. Supp Suppose ose we hav havee a quadratic function (many as we would ect in based on the alone.but Wecan canbethink of the second functions thatexp arise practice are gradient not quadratic approximated well deriv ative as measuring curvatur e . Supp ose we hav e a quadratic function (many as quadratic, at least lo locally). cally). If suc such h a function has a second deriv derivativ ativ ativee of zero, functions that arise in practice are not quadratic but can b e approximated well then there is no curv curvature. ature. It is a perfectly flat line, and its value can be predicted as quadratic, at least lo cally). If suc h a function has a second deriv ativ e of zero, using only the gradient. If the gradient is 1, then we can mak makee a step of size then there is no curv ature. Itand is athe perfectly flat line, and its value predicted . Ifbethe along the negative gradient, cost function will decrease bycan second using only the gradient. If the gradient is 1 , then w e can mak e a step of deriv derivative ative is negative, the function curves down downw ward, so the cost functionsize will along thedecrease negativeby gradient, and the cost ,function will decrease by is . Ifpositiv the second actually more than . Finally Finally, if the second deriv derivative ative ositive, e, the deriv ative is negative, the function curves down w ard, so the cost function will function curves upw upward, ard, so the cost function can decrease by less than . See Fig. actually decrease by more than . Finally, if the second derivative is positive, the 86 function curves upward, so the cost function can decrease by less than . See Fig.
CHAPTER 4. NUMERICAL COMPUTATION
No curv curvature ature
Positiv ositivee curv curvature ature
Negative curvature
No curvature
Positive curvature
x
f (x)
f (x)
f (x)
Negativ Negativee curv curvature ature
x
x
Figure 4.4: The second deriv derivativ ativ ativee determines the curv curvature ature of a function. Here we show quadratic functions with various curv curvature. ature. The dashed line indicates the value of the cost Figure 4.4: second deriv ativon e determines theinformation curvature of a function. Herea we show function we The would exp expect ect based the gradient alone as we make gradient quadratic functions with v arious curv ature. The dashed line indicates the v alue of the cost step downhill. In the case of negative curv curvature, ature, the cost function actually decreases function we the would exp ectpredicts. based onIn the gradient alone we make a gradient faster than gradient the case ofinformation no curv curvature, ature, theas gradient predicts the step downhill. In. the casecase of negative curv ature, the function actually decreases decrease correctly correctly. In the of p ositive curv curvature, ature, thecost function decreases slow slower er than faster than the gradient In the case ofono curv the gradient predicts the exp expected ected and even eventually tually bpredicts. egins to increase, so to too large ofature, step sizes can actually increase decrease correctly . In the the function inadverten inadvertently tly tly..case of p ositive curvature, the function decreases slower than expected and eventually begins to increase, so too large of step sizes can actually increase the function inadvertently.
4.4 to see how different forms of curv curvature ature affect the relationship betw etween een the value of the cost function predicted by the gradient and the true value. 4.4 to see how different forms of curvature affect the relationship between the value When our function has multiple input dimensions, there are many second of the cost function predicted by the gradient and the true value. deriv derivatives. atives. These deriv derivatives atives can be collected together into a matrix called the Whenmatrix our function has multiple input therethat are many second Hessian matrix. . The Hessian matrix H (f )(xdimensions, ) is defined such derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix H (f )(∂x2) is defined such that f (x). (4.6) H (f )(x)i,j = ∂ x i ∂ xj ∂ f (x). (4.6) H (f )(x) = Equiv Equivalently alently alently,, the Hessian is the Jacobian ∂ofx the ∂ x gradient. An Anywhere ywhere that the second partial deriv derivativ ativ atives es are contin continuous, uous, the differential Equivalently, the Hessian is the Jacobian of the gradient. op operators erators are commutativ commutative, e, i.e. their order can be swapped: Anywhere that the second partial derivatives are continuous, the differential ∂ 2 their order can ∂ 2 be swapped: operators are commutative, i.e. f (x) = f (x). (4.7) ∂ x i ∂ xj ∂ x j ∂ xi ∂ ∂ f (x) = f (x). (4.7) H = H This implies that i,j is symmetric at such points. ∂j,ix, ∂sox the Hessian ∂ x matrix ∂x Most of the functions we encoun encounter ter in the context of deep learning ha hav ve a symmetric H = H This implies that , so the Hessian matrix is symmetric at points. Hessian almost everywhere. Because the Hessian matrix is real andsuch symmetric, Most of the functions we encoun ter in the context of deep learning ha v e a symmetric we can decomp decompose ose it in into to a set of real eigen eigenv values and an orthogonal basis of Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real87eigenvalues and an orthogonal basis of
CHAPTER 4. NUMERICAL COMPUTATION
eigen eigenv vectors. The second deriv derivative ative in a sp specific ecific direction represented by a unit vector d is giv given en by d> H d. When d is an eigenv eigenvector ector of H , the second deriv derivativ ativ ativee eigen v ectors. The second deriv ative in a sp ecific direction represented by a unit in that direction is given by the corresp corresponding onding eigenv eigenvalue. alue. For other directions of d d H d d H v ector is giv en by . When is an eigenv ector of , the second deriv ativ e d , the directional second deriv derivativ ativ ativee is a weigh weighted ted average of all of the eigen eigenv values, in that direction is given by the corresp onding eigenv alue. F or other directions of with weigh eights ts bet etw ween 0 and 1, and eigenv eigenvectors ectors that hav havee smaller angle with d d , the directional second deriv ativeum is aeigenv weigh ted determines average of all the eigensecond values, receiving more weigh eight. t. The maxim maximum eigenvalue alue theofmaximum d with weighand ts bthe etwminim een 0 um andeigenv 1, and eigenv ectors that have smaller angle with deriv derivative ative minimum eigenvalue alue determines the minim minimum um second deriv derivative. ative. receiving more weight. The maximum eigenvalue determines the maximum second The (directional) second deriv derivative ative tells us how well we can exp expect ect a gradient derivative and the minimum eigenvalue determines the minimum second derivative. descen descentt step to perform. We can mak makee a second-order Taylor series appro approximation ximation Thefunction (directional) second deriv ative tells us x how (0) well we can exp ect a gradient to the f (x) around the current point : descent step to perform. We can make a second-order Taylor series approximation 1 to the function (0) f (x) f≈(xf)(xaround ) + (the x −current x(0) )>gp+oint(xx − :x(0) )> H (x − x (0)). (4.8) 2 1 ) g + f ( x ) f ( x ) + ( x x (x x(0) ) H (x x ). (4.8) where g is the gradient and H is the Hessian 2 at x . If we use a learning rate ≈ point x will − − − g. Substituting of , then the new be given by x (0) − this into our where is the gradient and is the Hessian at . If w e use a learning rate g H x appro approximation, ximation, we obtain g. Substituting this into our of , then the new point x will be given by x > − 1 2 > approximation, wefobtain (0) (0) (x − g) ≈ f (x ) − g g + g H g. (4.9) 2 1 g) thef (original x ) vgalue f (x here: g +of the g H g. (4.9) There are three terms function, the exp expected ected 2 − e of ≈ the function, − impro improv vemen ementt due to the slop slope and the correction we must apply There are three terms here: the original v alue thelast function, the expected to account for the curv curvature ature of the function. Whenofthis term is to too o large, the impro v emen t due to the slop e of the function, and the correction we must apply > gradien gradientt descen descentt step can actually mov movee uphill. When g H g is zero or negative, to account for the curv ature of the function. this last term toodecrease large, the forev f the Taylor series appro approximation ximation predicts thatWhen increasing forever er iswill gradien descen t stepthe can mov uphill. When orlarge negative, g H g is zerofor , so forev forever. er.t In practice, Taactually ylor series is eunlik unlikely ely to remain accurate large fe, the T ylorresort seriestoappro forever inincreasing g>will H g decrease one maust moreximation heuristic predicts choices ofthat this case. When is positiv ositive, forever. for In the practice, thestep Taylor is unlikelythe to T remain accurate for large , so solving optimal sizeseries that decreases aylor series approximation of g H g one m ust resort to more heuristic c hoices of in this case. When is p ositiv e, the function the most yields solving for the optimal step size that decreases g >g the Taylor series approximation of ∗ . (4.10) = the function the most yields g> H g g g . ector of H corresp (4.10) =the eigenv In the worst case, when g aligns with eigenvector corresponding onding to the g Hg 1 λ maximal eigenv eigenvalue alue max , then this optimal step size is given by λ max . To the In thet worst case, when gwaligns with the ector of H corresp to the exten extent that the function e minimize caneigenv be appro approximated ximated well bonding y a quadratic λ ,ofthen maximal eigenv alue this optimal size isthe given byof the. learning To the function, the eigen eigenv values the Hessian thus step determine scale exten t that the function w e minimize can b e appro ximated w ell b y a quadratic rate. function, the eigenvalues of the Hessian thus determine the scale of the learning The second deriv derivative ative can be used to determine whether a critical point is a rate. lo local cal maximum, a lo local cal minimum, or saddle point. Recall that on a critical point, second deriv be used to that determine whether aascritical peoint is a f 0(xThe f 00(ative x) > can f 0(x) increases )=0 . When 0, this means we mov move to the lo cal maximum, a lo cal minimum, or saddle p oint. Recall that on a critical p oint, righ right, t, and f 0 (x ) decreases as we mov movee to the left. This means f 0 ( x − ) < 0 and f (x) = 0. When f (x) > 0, this means that f (x) increases as we move to the 88 the left. This means f ( x ) < 0 and right, and f (x ) decreases as we move to −
CHAPTER 4. NUMERICAL COMPUTATION
f 0(x + ) > 0 for small enough . In other words, as we mov movee right, the slop slopee begins to poin ointt uphill to the right, and as we mo mov ve left, the slop slopee begins to point uphill f (the x + left. ) > 0Thus, .0Inand for small other mov e right, that the slop egins to whenenough wewe can conclude local cal f 0 (x ) = f 00(xw)ords, > 0, as x iseablo to p oin t uphill to the right, and as w e mo v e left, the slop e b egins to p oint uphill 0 00 minim minimum. um. Similarly Similarly,, when f ( x) = 0 and f (x) < 0, we can conclude that x is a to the left. Thus,This whenisfknown 0 and 0, we can test conclude that x is a, when local (x ) = as (cx ) >derivative lo local cal maximum. the fse sec ond test. . Unfortunately Unfortunately, 00 (x um. Similarly, when f ( x) = 0 and f (x)x < 0, we can conclude that x is a fminim ) = 00,, the test is inconclusive. In this case ma may y be a saddle poin oint, t, or a part lo cal maximum. This is known as the se c ond derivative test . Unfortunately , when of a flat region. f (x) = 0, the test is inconclusive. In this case x may be a saddle point, or a part In multiple dimensions, we need to examine all of the second deriv derivatives atives of the of a flat region. function. Using the eigendecomp eigendecomposition osition of the Hessian matrix, we can generalize multiple dimensions, to examine all of the derivpatives of the the In second deriv derivativ ativ ativee test we to need multiple dimensions. At second a critical oint, where function. Using of the matrix, we can generalize ∇ 0,, we the can eigendecomp examine the osition eigen eigenv values of Hessian the Hessian to determine whether x f( x) = 0 the second deriv ativ e test to multiple dimensions. A t a critical p oint, where the critical point is a lo local cal maximum, lo local cal minimum, or saddle point. When the f ( x ) = 0 , w e can examine the eigen v alues of the Hessian to determine whether Hessian is positiv ositivee definite (all its eigenv eigenvalues alues are positive), the poin ointt is a lo local cal the critical pointcan is ablo maximum, localthat minimum, or saddlesecond point. deriv Whenativ thee ∇ minim minimum. um. This e cal seen by observing the directional derivativ ative Hessian is positiv e definite (all itsand eigenv alues reference are positive), poin t is second a local in any direction must be positive, making to thethe univ univariate ariate minim um.test. ThisLik can be seen bythe observing the directional second derivalues ative deriv derivative ative Likewise, ewise, when Hessian that is negativ negative e definite (all its eigenv eigenvalues in any direction must b e p ositive, and making reference to the univ ariate second are negative), the point is a lo local cal maxim maximum. um. In multiple dimensions, it is actually deriv ativetotest. ewise, evidence when theofHessian negativ definite (all When its eigenv p ossible find Lik positive saddle ispoin oints ts inesome cases. at alues least are negative), the p oint is a lo cal maxim um. In multiple dimensions, it is actually one eigenv eigenvalue alue is positiv ositivee and at least one eigenv eigenvalue alue is negative, we kno know w that p ossible to find p ositive evidence of saddle p oin ts in some cases. When at least local cal maximum on one cross section of f but a lo local cal minimum on another x is a lo one eigenv alue is p ositiv e and at least one eigenv alue is negative, we kno w that cross section. See Fig. 4.5 for an example. Finally Finally,, the multidimensional second localtest maximum one crosse,section of the a ariate local minimum on another x is aative f but deriv derivative can be on inconclusiv inconclusive, just like univ univariate version. The test is cross section. See Fig. 4.5 for an example. Finally , the multidimensional second inconclusiv inconclusivee whenev whenever er all of the non-zero eigenv eigenvalues alues ha hav ve the same sign, but at deriv ative test can b e inconclusiv e, just like the univ ariate version. The test least one eigenv eigenvalue alue is zero. This is because the univ univariate ariate second deriv derivative ative test is is inconclusiv e whenev er all of the non-zero eigenv alues ha v e the same sign, but at inconclusiv inconclusivee in the cross section corresp corresponding onding to the zero eigenv eigenvalue. alue. least one eigenvalue is zero. This is because the univariate second derivative test is In multiple dimensions, there can b e a wide variety of different second deriv derivatives atives inconclusive in the cross section corresponding to the zero eigenvalue. at a single point, because there is a different second deriv derivative ative for each direction. In multiple dimensions, there can b e a wide v ariety of different second deriv derivativ atives The condition num numb ber of the Hessian measures how muc much h the second derivativ atives es at a single p oint, b ecause there is a different second deriv ative for each direction. vary ary.. When the Hessian has a poor condition num number, ber, gradien gradientt descent performs The condition ber of measures how muc h the second deriv ativin es p oorly orly. . This is num because in the oneHessian direction, the deriv derivative ative increases rapidly rapidly, , while vary. When the Hessian has a slowly poor condition ber, gradien tare descent performs another direction, it increases slowly. . Gradientnum descent is unaw unaware of this change p o orly . This is b ecause in one direction, the deriv ative increases rapidly , while in in the deriv derivativ ativ ativee so it do does es not know that it needs to explore preferentially in another direction, it increases slowly . Gradient descent is unaw are of this c hange the direction where the deriv derivativ ativ ativee remains negativ negativee for longer. It also mak makes es it in the deriv ative so it do es not know that it needs to explore preferentially in difficult to choose a go d step size. The step size m ust b e small enough to av goo o avoid oid direction where the deriv ativgoing e remains negativ e for longer. also mak es it othe versho ershooting oting the minimum and uphill in directions with Itstrong positive difficult to This choose a goodmeans step size. step size size ismto usto b e small enough to avoid curv curvature. ature. usually that The the step too small to make significant o v ersho oting the minimum and going uphill in directions with strong positive progress in other directions with less curv curvature. ature. See Fig. 4.6 for an example. curvature. This usually means that the step size is too small to make significant This issue candirections be resolved byless using information from4.6 theforHessian matrix to progress in other with curv ature. See Fig. an example. This issue can be resolved by using information from the Hessian matrix to 89
CHAPTER 4. NUMERICAL COMPUTATION
0
f(x1 ;x2 )
500
−500
−15
x1 0
15
−15
15 0 x2
Figure 4.5: A saddle p oint containing b oth p ositive and negative curv curvature. ature. The function 2 2 in this example is f (x ) = x1 − x 2. Along the axis corresp corresponding onding to x1, the function Figure 4.5: ard. A saddle ointiscontaining bector oth pof ositive and negative curv The function curv curves es upw upward. This paxis an eigenv eigenvector the Hessian and has a ature. p ositive eigenv eigenvalue. alue. ) = x to x x2 ,. the in this the example is f (xonding Along the axis corresp onding to x direction , the function Along axis corresp corresponding function curv curves es down downward. ward. This is an curv es upward. axis with is an− eigenvector of alue. the Hessian and“saddle has a ppositive eigenvfrom alue. eigen eigenv vector of theThis Hessian negative eigenv eigenvalue. The name oint” derives x Along the axis corresp onding to , theThis function estessential downward. This direction is an the saddle-like shap shapee of this function. is thecurv quin quintessential example of a function eigenvaector of pthe Hessian with negative eigenvalue. The “saddle oint” derives with saddle oint. In more than one dimension, it is notname necessary to pha hav ve an eigen eigenv vfrom alue the saddle-like shap e of this function. This is the quin tessential example of a function of 0 in order to get a saddle point: it is only necessary to hav havee both positive and negative with vaalues. saddleWpeoint. In more onepdimension, is signs not necessary to havas e an eigen alue eigen eigenv can think of athan saddle oint with bitoth of eigenv eigenvalues alues being a vlo local cal of 0 in um order to getone a saddle point: it is only necessary to hav e both positive andsection. negative maxim maximum within cross section and a lo local cal minim minimum um within another cross eigenvalues. We can think of a saddle point with both signs of eigenvalues as being a local maximum within one cross section and a lo cal minimum within another cross section.
90
CHAPTER 4. NUMERICAL COMPUTATION
20
x2
10 0 −10 −20 −30 −30 −20 −10
0 x1
10
20
Figure 4.6: Gradient descent fails to exploit the curv curvature ature information contained in the Hessian matrix. Here we use gradient descent to minimize a quadratic function f ( x) whose Figure 4.6: Gradient descent num failsber to exploit curvthat ature information in the Hessian matrix has condition number 5. This the means the direction ofcontained most curv curvature ature f ( x ) Hessian matrix. Here we use gradient descent to minimize a quadratic function whose has five times more curv curvature ature than the direction of least curv curvature. ature. In this case, the most Hessian hasdirection condition that isthe of most curv curvature aturematrix is in the [1 [1,,num and 5. theThis leastmeans curv curvature ature in direction the direction [1, −curv The 1]> ber 1]> .ature has lines five times morethe curv ature thanedthe least curv ature. this case,quadratic the most red indicate path follow followed by direction gradient of descent. This veryInelongated [1 , 1]on.and [1, descending 1] . The curvatureresembles is in the direction the least curvature is intime the direction function a long cany canyon. Gradient descent wastes rep repeatedly eatedly red yon lineswindicate the path ed steep by gradient descent. This the verystep elongated quadratic can canyon alls, b ecause they follow are the steepest est feature. Because size is −somewhat function a long cany Gradient rep eatedly descending to too o large,resembles it has a tendency to on. ov overshoot ershoot thedescent b ottomwastes of the time function and thus needs to can yon w alls, b ecause they are the steep est feature. Because the step size is somewhat descend the opp opposite osite cany canyon on wall on the next iteration. The large p ositive eigenv eigenvalue alue to othe large, it hascorresp a tendency ershoot the b ottom of the function and indicates thus needs to of Hessian corresponding ondingtotoovthe eigenv eigenvector ector p oin ointed ted in this direction that descend the oppderiv ositeativ cany onrapidly wall onincreasing, the next iteration. The large algorithm p ositive eigenv this directional derivativ ative e is so an optimization basedalue on of the Hessian corresp onding thesteep eigenv p ointed in this direction indicates that the Hessian could predict thattothe steepest estector direction is not actually a promising search this directional ative is rapidly increasing, so an optimization algorithm based on direction in this deriv context. the Hessian could predict that the steep est direction is not actually a promising search direction in this context.
91
CHAPTER 4. NUMERICAL COMPUTATION
guide the search. The simplest metho method d for doing so is known as Newton Newton’s ’s metho method d. Newton’s metho method d is based on using a second-order Taylor series expansion to guide ximate the search. metho d :for doing so is known as Newton’s method. appro approximate f (x)The nearsimplest some poin oint t x(0) Newton’s method is based on using a second-order Taylor series expansion to approximate(0) f (x) near some point x : 1 f (x) ≈ f (x )+(x−x(0) ) >∇ x f (x(0))+ (x−x (0))> H (f )(x(0) )( )(x x−x(0)). (4.11) 2 1 ) (x of )+this (x)then f (solve x )+ (xthex critical (x x ) H (f )(x )(x x ). (4.11) Iffwe for pf oint 2 function, we obtain: ≈ − ∇ − − (0) obtain: If we then solve for thex∗critical H (of f )(this x(0) function, = x(0) p−oint )−1 ∇x f (xwe ). (4.12) H (f )(xfunction, = x quadratic ) fNewton’s (x ). metho (4.12) When f is a positive xdefinite method d consists of applying Eq. 4.12 once to jump to directly.. When f is − the minimum of ∇the function directly f When is a p ositive definite quadratic function, Newton’s metho d consists of not truly quadratic but can be lo locally cally approximated as a positive definite quadratic, applying Eq. 4.12 consists once to jump to the Eq. minimum the function directlyely . When f is Newton’s method of applying 4.12 mof ultiple times. Iterativ Iteratively up updating dating not truly quadratic but be locally approximated a pappro ositive definite quadratic, the approximation and can jumping to the minimum ofasthe approximation ximation can reach Newton’s method consists of applying Eq. 4.12 m ultiple times. Iterativ ely dating the critical point muc much h faster than gradient descent would. This is a useful up prop propert ert erty y the approximation and jumping to the minimum of the appro ximation can reach near a lo local cal minimum, but it can be a harmful prop property erty near a saddle point. As the criticalinpoint h faster than gradient This iswhen a useful erty discussed Sec.muc 8.2.3 , Newton’s metho method ddescent is onlywould. appropriate theprop nearby near a lo caltminimum, but(all it can e a harmful erty near point. As critical poin oint is a minimum the beigen eigenv values of prop the Hessian areapsaddle ositive), whereas discussed in Sec. , Newton’s is only appropriate when the nearby gradien gradientt descen descent t is8.2.3 not attracted to metho saddledpoints unless the gradient points tow toward ard critical p oin t is a minimum (all the eigen v alues of the Hessian are p ositive), whereas them. gradient descent is not attracted to saddle points unless the gradient points toward Optimization algorithms such as gradient descen descentt that use only the gradien gradientt are them. called first-or first-order der optimization algorithms algorithms.. Optimization algorithms such as NewOptimization algorithms such as gradient t thatse use onlyder the optimization gradient are ton’s metho method d that also use the Hessian matrixdescen are called sec cond-or ond-order called first-or der optimization algorithms algorithms (No Nocedal cedal and Wright , 2006). . Optimization algorithms such as Newton’s method that also use the Hessian matrix are called second-order optimization The optimization algorithms emplo employ algorithms (Nocedal and Wright, 2006 ). yed in most contexts in this book are applicable to a wide variety of functions, but come with almost no guaran guarantees. tees. This The optimization algorithms emplo y ed in most contexts in this book are is because the family of functions used in deep learning is quite complicated. In applicable to a wide v ariety of functions, but come with almost no guaran tees. This man many y other fields, the dominant approach to optimization is to design optimization is because the of family functions used in deep learning is quite complicated. In algorithms for afamily limited of functions. many other fields, the dominant approach to optimization is to design optimization In the context of deep learning, we sometimes gain some guarantees by restrictalgorithms for a limited family of functions. ing ourselv ourselves es to functions that are either Lipschitz continuous or ha hav ve Lipsc Lipschitz hitz In the context of deep learning, we sometimes gain some guarantees by restrictf con deriv A Lipsc contin function is a function whose rate contin tin tinuous uous derivativ ativ atives. es. Lipschitz hitz continuous uous ingchange ourselvisesbounded to functions are either Lipschitz continuous or have Lipschitz of by a that Lipschitz constant L: continuous derivatives. A Lipschitz continuous function is a function f whose rate of change is bounded by : x − y ||2 . ∀xa, ∀Lipschitz y, |f (x) −constant f (y)| ≤ L|| (4.13) L y, f (xit) allo f (ws y) us to quantify x y .our assumption that (4.13)a This prop propert ert erty y is usefulxb, ecause allows small change in the input such−as ||gradien gradientt descen descentt will hav havee ∀ made ∀ | by an − algorithm | ≤ L|| This prop ert y is useful b ecause it allo ws us to quantify our assumption that a a small change in the output. Lipschitz con contin tin tinuit uit uity y is also a fairly weak constrain constraint, t, small change in the input made by an algorithm such as gradient descent will have a small change in the output. Lipschitz 92 continuity is also a fairly weak constraint,
CHAPTER 4. NUMERICAL COMPUTATION
and many optimization problems in deep learning can be made Lipschitz con contin tin tinuous uous with relatively minor mo modifications. difications. and many optimization problems in deep learning can be made Lipschitz continuous Perhaps the most successful field of sp specialized ecialized optimization is convex optimizawith relatively minor modifications. tion tion.. Conv Convex ex optimization algorithms are able to provide many more guarantees P erhaps the mostrestrictions. successful field ofvsp optimization is convex optimizaby making stronger Con Conv execialized optimization algorithms are applicable tion. to Conv exexoptimization algorithms ablethe to provide more guarantees only conv convex functions—functions forare which Hessian many is positiv ositive e semidefinite b yerywhere. making stronger restrictions. Convex optimization algorithms are papplicable ev everywhere. Suc Such h functions are well-behav well-behaved ed because they lack saddle oints and only to conv ex functions—functions for which the Hessian is p ositiv e semidefinite all of their lo local cal minima are necessarily global minima. Ho How wev ever, er, most problems ev erywhere. Suc h functions are well-behav ed b ecause they lack saddle points and in deep learning are difficult to express in terms of conv convex ex optimization. Conv Convex ex all of their local minima minima. However, most problems optimization is used only are as anecessarily subroutineglobal of some deep learning algorithms. Ideas in deep difficult to express inalgorithms terms of conv Convthe ex from thelearning analysisare of conv convex ex optimization can ex be optimization. useful for proving optimization is deep used only as aalgorithms. subroutine of some deep learningthe algorithms. Ideas con conv vergence of learning How Howev ev ever, er, in general, imp importance ortance of from the analysis of conv ex optimization algorithms can b e useful for proving the con conv vex optimization is greatly diminished in the context of deep learning. For convergence of deep learning algorithms. Howsee ever, in the imp ortance of) more information ab about out conv convex ex optimization, Bo Boyd ydgeneral, and Vanden andenb berghe (2004 conRo vex optimization or Rock ck ckafellar afellar (1997).is greatly diminished in the context of deep learning. For more information about convex optimization, see Boyd and Vandenberghe (2004) or Rockafellar (1997).
4.4
Constrained Optimization
Sometimes we wish not only to maximize or minimize a function f (x) ov over er all 4.4 Constrained Optimization possible values of x. Instead we ma may y wish to find the maximal or minimal value of Sometimes we wish not only to maximize or minimize a function er all f (x) ov f (x) for values of x in some set S. This is known as constr onstraine aine ained d optimization optimization. . Poin Points ts x p ossible v alues of . Instead w e ma y wish to find the maximal or minimal v alue of x that lie within the set S areScalled fe feasible asible poin oints ts in constrained optimization fterminology. (x) for values set . This is known as constrained optimization. Points terminology . of x in some S x that lie within the set are called feasible points in constrained optimization We often wish to find a solution that is small in some sense. A common terminology. approac approach h in such situations is to imp impose ose a norm constrain constraint, t, such as ||x|| ≤ 1. We often wish to find a solution that is small in some sense. A common One hsimple approac is simply to as mo approach h to constrained modify dify approac in such situations is to imposeoptimization a norm constrain t, such x gradient 1. descen descentt taking the constraint into account. If we use a small constant step size , || ≤ Onemake simple approac h to constrained simply to mo||dify S. Ifgradient we can gradien gradient t descent steps, thenoptimization pro project ject the is result bac back k into we use descen t taking the constraint into account. If we use a small constant step size , a line searc search, h, we can search only ov over er step sizes that yield new x poin oints Sts that are we can make t descent steps, pro ject result into . If we use feasible, or wegradien can pro pointthen on the line the back into bac thekconstraint region. project ject each that yield x poin a line searc h, wethis canmetho searchd only stepmore sizes efficient new ts that are When possible, method can bov e er made by pro projecting jecting the gradient feasible, or we can pro ject each point region on the bline back into the the step constraint region. in into to the tangen tangent t space of the feasible efore taking or beginning When p ossible, this metho d can b e made more efficient by pro jecting the gradient the line search (Rosen, 1960). into the tangent space of the feasible region before taking the step or beginning more sophisticated approach is to design a different, unconstrained optithe A line search (Rosen, 1960 ). mization problem whose solution can be conv converted erted in into to a solution to the original, A more sophisticated approach is to design a different, constrained optimization problem. For example, if we wan wantt tounconstrained minimize f( x)optifor mization problem whose solution can b e conv erted in to a solution to the original, 2 2 x ∈ R with x constrained to ha hav ve exactly unit L norm, we can instead minimize constrained optimization problem. For example, if we want to minimize f( x) for R x with x constrained to have exactly unit L norm, we can instead minimize 93 ∈
CHAPTER 4. NUMERICAL COMPUTATION
g(θ ) = f ([cos θ, sin θ]> ) with resp respect ect to θ, then return [ cos θ, sin θ] as the solution to the original problem. This approac approach h requires creativit creativity; y; the transformation gb(et θ ) = f ([ cos θ, sin θ ] θ cos θ, sin θfor ) with resp ect to , then return [ ] aseac theh solution etw ween optimization problems must be designed sp specifically ecifically each case we to the original problem. This approac h requires creativit y; the transformation encoun encounter. ter. between optimization problems must be designed specifically for each case we The Karush–Kuhn–T Karush–Kuhn–Tucker ucker (KKT) approach1 pro provides vides a very general solution encounter. to constrained optimization. With the KKT approach, we in intro tro troduce duce a new function The Karush–Kuhn–T ucker (KKT) approach pro vides a very general solution called the gener generalize alize alized d Lagr agrangian angian or gener generalize alize alized d Lagr agrange ange function . to constrained optimization. With the KKT approach, we introduce a new function To define the Lagrangian, we first need to describ describee in terms of equations called the generalized Lagrangian or generalized LagrangeSfunction. and inequalities. W Wee wan antt a description of S in terms ofSm functions g (i) and n To define Lagrangian, of equations (i) (x)need h(j) the = 0Sto 0} . The i, gfirst anddescrib ∀j, h (je)( x )in≤terms S = {x | ∀we functions so that equations m g and inequalities. W e w an t a description of in terms of functions and ( i ) h (nj) in inv volving g are called onstraints aints and the inequalities inv involving olving S the equality constr h = ( x ) = 0 ( x ) 0 x i, g and j, h functions so thatconstr . The equations are called ine inequality quality onstraints aints aints.. involving g are called the{ equality c onstr aints and the inequalities |∀ ∀ ≤ } involving h We in intro tro troduce duce new variables λi and α j for each constraint, these are called the are called inequality constraints. KKT multipliers. The generalized Lagrangian is then defined as We introduce new variables λ and these are called the Xα for each constraint, X (i) is then defined (j ) as KKT multipliers. The generalized Lagrangian L(x, λ, α) = f (x) + λ ig (x) + α j h (x). (4.14) i
j
L(x, λ, α) = f (x) + λ g (x) + α h (x). (4.14) We can no now w solve a constrained minimization problem using unconstrained optimization of the generalized Lagrangian. Observe that, so long as at least one We can noexists w solve a constrained problem unconstrained feasible point and f (x) is not pminimization ermitted to hav have e value using ∞, then X optimization of the generalized Lagrangian. ObserveX that, so long as at least one feasible point exists and f (xmin ) ismax not pmax ermitted hav L(x, to λ, α ). e value , then (4.15) x λ α,α≥0 ∞ min max max L(x, λ, α). (4.15) has the same optimal ob objectiv jectiv jectivee function value and set of optimal points x as has the same optimal ob jective function as min fv(alue x). and set of optimal points x (4.16) x∈S
min f (x). are satisfied, This follows because any time the constraints
(4.16)
This follows because any time max the maxconstraints L(x, λ, α)are = fsatisfied, (x),
(4.17)
max max L(x, λ, α) = f (x), while any time a constraint is violated,
(4.17)
while any time a constraintmax is violated, max L(x, λ, α) = ∞.
(4.18)
λ
λ
α,α≥0
α,α≥0
(4.18) max max L(x, λ, α) = . These prop properties erties guarantee that no infeasible poin ointt will ever be optimal, and that ∞ the optimum within the feasible poin oints ts is unchanged. These properties guarantee that no infeasible point will ever be optimal, and that 1 KKT approach generalizes method Lagrange multipliers which allows equality the The optimum within the feasiblethepoin ts is of unchanged. constraints but not inequality constraints.
94
CHAPTER 4. NUMERICAL COMPUTATION
To perform constrained maximization, we can construct the generalized Lagrange function of −f (x), whic which h leads to this optimization problem: To perform constrained maximization, we can construct the generalized LaX X (i)optimization problem: grange function ofmax f (xmax ), whic this min α j h(j)(x). −fh(xleads ) + to λ (4.19) ig (x) + x −λ α,α≥0 i j min max max f (x) + λ g (x) + α h (x). (4.19) We ma may y also conv convert ert this to − a problem with maximization in the outer lo loop: op: X X ) (j )the outer lo op: We may also conv ert thismin to afproblem (x) − max min (x) + with λ ig (imaximization α j hin (x). (4.20) x α , α ≥0 λ X X i j max min min f (x) + λ g (x) α h (x). (4.20) The sign of the term for the equality constraints do does es not matter; we may define it − with addition or subtraction as we wish, because the optimization is free to cho hoose ose The sign of the term for the equality constraints do es not matter; we may define it an any y sign for each λi. with addition or subtraction as we wish, is free to choose Xbecause the optimization X The inequality constrain constraints ts are particularly in interesting. teresting. We say that a constraint any sign for each λ . h(i) (x ) is active if h(i) ( x ∗) = 00.. If a constraint is not activ active, e, then the solution to The inequality constrain ts are particularly in teresting. W e sayathat constraint the problem found using that constrain constraintt would remain at least lo local cala solution if h ( x h ( x ) is active if ) = 0 . If a constraint is not activ e, then the solution to that constrain constraintt were remo remov ved. It is possible that an inactiv inactivee constrain constraintt excludes the problem found that constrain would remain least region a local of solution if other solutions. Forusing example, a conv convex ex tproblem with anatentire globally that constrain remoflat, ved. region It is possible an could inactivhav e constrain t excludes optimal poin oints tst w (aere wide, of equalthat cost) have e a subset of this other solutions. F or example, a conv ex problem with an entire region of globally region eliminated by constrain constraints, ts, or a non-conv non-convex ex problem could hav havee better lo local cal optimal p oin ts (a wide, flat, region of equal cost) could hav e a subset of this stationary poin oints ts excluded by a constraint that is inactiv inactivee at conv convergence. ergence. How Howev ev ever, er, region eliminated byconv constrain ts, remains or a non-conv ex problem could have bor etter cal the point found at convergence ergence a stationary point whether notlothe stationary points excluded by a constraint is inactiv ergence. Howthen ever, h(ei) at inactiv inactivee constrain constraints ts are included. Because that an inactive hasconv negativ negative e value, the solution point found at conv ergence remains a stationaryhave point notthus the the to min e αi whether = 0. Weorcan x maxλ max α,α≥0 L( x, λ, α) will hav h inactiv e constrain ts are included. Because an inactive has negativ e v alue, then αh((x ) = 0. In other words, for all i , we know that at observ observee that at the solution, αh min max max the solution to ) will havbee αactive = 0.atW e can thus α x,)α≤ least one of the constraints i ≥ 0 andLh( (xi),(λ 0 must the solution. αh(xidea, ) = 0we observ e that the solution, . Incan other all ithe , wesolution know that at T o gain someatintuition for this saywords, that for either is on α inequalit (x ) we 0mmust leastboundary one of theimp constraints 0 and hy and be its active at m the solution. the imposed osed by the inequality ust use KKT ultiplier to T o gain some intuition for this idea, we can say that either the solution is on ≥ ≤ influence the solution to x, or the inequalit inequality y has no influence on the solution and the b oundary imp osed b y the inequalit y and we must use its KKT multiplier to we represent this by zeroing out its KKT multiplier. influence the solution to x, or the inequality has no influence on the solution and The prop properties erties that the gradien gradientt of the generalized Lagrangian is zero, all we represent this by zeroing out its KKT multiplier. constrain constraints ts on both x and the KKT multipliers are satisfied, and α h (x) = 0 The prop that the gradien t of theconditions generalized Lagrangian zero,and all are called theerties Karush-Kuhn-T Karush-Kuhn-Tuc uc uck ker (KKT) (Karush , 1939;isKuhn h (x) = 0 constrain ts on bogether, oth x and theprop KKT multipliers satisfied,poin andtsαof constrained Tuck ). T these describ the optimal ucker er, 1951 properties erties describee are oints are called theproblems. Karush-Kuhn-Tucker (KKT) conditions (Karush, 1939 ; Kuhn and optimization Tucker, 1951). Together, these properties describe the optimal points of constrained For more information about out the KKT approach, see No Nocedal cedal and Wrigh rightt (2006). optimization problems. ab For more information about the KKT approach, see Nocedal and Wright (2006). 95
CHAPTER 4. NUMERICAL COMPUTATION
4.5
Example: Linear Least Squares
Supp Suppose we wan wantt to find the value of x that minimizes 4.5 oseExample: Linear Least Squares Suppose we want to find the value of 1x that minimizes (4.21) f (x) = ||Ax − b||22 . 2 1 . solve this problem efficiently (4.21) . Ax that b can f (x)algorithms = There are sp specialized ecialized linear algebra efficiently. 2 || it − || gradien Ho How wev ever, er, we can also explore ho how w to solve using gradient-based t-based optimization as There are sp ecialized linear algebra algorithms that can solve this problem efficiently. a simple example of ho how w these techniques work. However, we can also explore how to solve it using gradient-based optimization as First, we need to obtain the gradient: a simple example of how these techniques work. First, we need to the gradient: ∇obtain (Ax − b) = A>Ax − A >b. A> xf (x) =
(4.22)
b) =taking A Ax (x)gradient = A (Ax b. See Algorithm (4.22) We can then follo follow w fthis do downhill, wnhill, smallAsteps. 4.1 ∇ − − for details. We can then follow this gradient downhill, taking small steps. See Algorithm 4.1 for details. 4.1 An algorithm to minimize f( x) = 12 ||Ax − b||22 with resp respect ect to x Algorithm using gradient descent. Ax b with respect to x Algorithm 4.1 An algorithm to minimize f( x) = Set the step size () and tolerance (δ ) to small, positive num umb bers. using gradient descent.> || − || > while || ||A A Ax b|| 2 > δ do >−(A Setx the ) and (δ ) to small, positive numbers. >b A Ax ← xstep − size − Atolerance while A Ax A b > δ do end while x x || A −Ax ||A b end ← while − problem using Newton’s metho One can−also solve this method. d. In this case, because the true function is quadratic, the quadratic approximation employ employed ed by Newton’s One can also solve this problem using Newton’s metho d. In this case, metho method d is exact, conv converges erges to the global minimum in b a ecause single and the algorithm the true function is quadratic, the quadratic approximation employ ed by Newton’s step. method is exact, and the algorithm converges to the global minimum in a single No Now w supp suppose ose w wee wish to minimize the same function, but sub subject ject to the step. > x ≤ 1. To do so, we introduce the Lagrangian constrain constraintt x Now suppose we wish to minimize the same function, but sub ject to the > Lagrangian so, we introduce the constraint x x 1. To do L(x, λ) = f (x) + λ x x − 1 . (4.23) ≤ L(x, λ) = f (x) + λ x x 1 . (4.23) We can now solve the problem − We can now solve the problem min max L(x, λ). (4.24) x
λ,λ≥0
min max L( x, λ). (4.24) The smallest-norm solution to the unconstrained least squares problem may be found using the Mo Moore-Penrose ore-Penrose pseudoinv pseudoinverse: erse: x = A+ b. If this point is feasible, The smallest-norm solution to the unconstrained squares problem then it is the solution to the constrained problem.least Otherwise, we mustmay findbae found using the Moore-Penrose pseudoinverse: x = A b. If this point is feasible, 96 problem. Otherwise, we must find a then it is the solution to the constrained
CHAPTER 4. NUMERICAL COMPUTATION
solution where the constraint is active. By differentiating the Lagrangian with resp respect ect to x, we obtain the equation solution where the constraint is active. By differentiating the Lagrangian with > respect to x, we obtain theA equation (4.25) Ax − A> b + 2λx = 00.. b +form 2λx = 0. A will Ax take A the This tells us that the solution − This tells us that the solution take x =will (A > A +the 2λIform ) −1A >b.
(4.25) (4.26)
= (A suc A + λI ) theAresult b. ob (4.26) The magnitude of λ must bex chosen such h 2that obeys eys the constrain constraint. t. We can find this value by performing gradient ascent on λ. To do so, observ observee The magnitude of λ must be chosen such that the result obeys the constraint. We ∂ gradient ascent can find this value by performing on λ. To do so, observe L(x, λ) = x> x − 1. (4.27) ∂λ ∂ L(x , λ) ative = x isx positiv 1. e, so to follow the deriv (4.27) When the norm of x exceeds 1,∂ this derivative ositive, derivativ ativ ativee λ deriv − uphill and increase the Lagrangian with resp respect ect to λ, we increase λ . Because the When thet norm exceeds 1, this increased, derivative solving is positiv e, so to follow the deriv co coefficien efficien efficient on theofxx> x penalt enalty y has the linear equation for xativ wille uphill andaincrease Lagrangian withThe resppro ect cess to λof , we increase . Because the no now w yield solutionthe with smaller norm. process solving theλlinear equation x x x coefficien t on the penalt y has increased, solving theand linear for will λ con x has and adjusting contin tin tinues ues until the correct norm theequation deriv derivativ ativ ative e on λ is no w yield a solution with smaller norm. The pro cess of solving the linear equation 0. and adjusting λ continues until x has the correct norm and the derivative on λ is This concludes the mathematical preliminaries that we use to dev develop elop machine 0. learning algorithms. We are no now w ready to build and analyze some full-fledged This concludes the mathematical preliminaries that we use to develop machine learning systems. learning algorithms. We are now ready to build and analyze some full-fledged learning systems.
97
Chapter 5 Chapter 5
Mac Machine hine Learning Basics hine Mac Learning Basics Deep learning is a sp specific ecific kind of mac machine hine learning. In order to understand deep learning well, one must ha have ve a solid understanding of the basic principles Deep learning is a spThis ecificchapter kind ofpro mac hinea brief learning. to understand of mac machine hine learning. provides vides courseIninorder the most important deep learning w ell, one must ha ve a solid understanding of the basic principles general principles that will be applied throughout the rest of the bo book. ok. No Novice vice of mac hine learning. This c hapter pro vides a brief course in the most important readers or those who wan antt a wider persp perspectiv ectiv ectivee are encouraged to consider mac machine hine general principles that will b e applied throughout the rest of the bo ok. No vice learning textb textbo ooks with a more comprehensive co cov verage of the fundamen fundamentals, tals, suc such h readers or ythose who ant a wider persp are encouraged to consider machine hine as Murph Murphy (2012 ) orwBishop (2006 ). Ifectiv youe are already familiar with mac machine learning textb ooks coverage the fundamen tals, suc h learning basics, feelwith free atomore skip comprehensive ahead to Sec. 5.11 . Thatofsection cov covers ers some peras Murph y (traditional 2012) or Bishop (2006 ). If tec youhniques are already familiar withinfluenced machine sp spectiv ectiv ectives es on mac machine hine learning techniques that hav have e strongly learning basics, feel free to skip ahead to Sec. 5.11 . That section cov ers some perthe dev developmen elopmen elopmentt of deep learning algorithms. spectives on traditional machine learning techniques that have strongly influenced e elopmen begin with definition what a learning algorithm is, and present an the W dev t of adeep learningofalgorithms. example: the linear regression algorithm. W Wee then pro proceed ceed to describ describee how the W e b egin with a definition of what a learning algorithm is, and present an challenge of fitting the training data differs from the challenge of finding patterns example: the linear regression algorithm. We learning then proceed to describ ee how the that generalize to new data. Most mac machine hine algorithms hav have settings challenge of fitting the training data differs fromexternal the challenge findingalgorithm patterns called hyp yperparameters erparameters that must be determined to the of learning that generalize new Mostusing machine learning algorithms havlearning e settings itself; we discusstoho how w todata. set these additio additional nal data. Mac Machine hine is called h yp erparameters that m ust b e determined external to the learning algorithm essen essentially tially a form of applied statistics with increased emphasis on the use of itself; we discuss how toestimate set these using additio nal data. hine learning is computers to statistically complicated functions and aMac decreased emphasis essen tially form of applied with functions; increased w emphasis on present the usethe of on pro proving ving aconfidence in interv terv tervals alsstatistics around these e therefore statistically and a decreased emphasis tcomputers wo centraltoapproac approaches hes to estimate statistics:complicated frequen frequentist tistfunctions estimators and Ba Bay yesian inference. on promachine ving confidence tervals around functions; e thereforeofpresent the Most learning in algorithms can bethese divided in into to thewcategories sup supervised ervised tlearning wo central approac hes to statistics: frequen tist estimators and Ba y esian inference. and unsup unsupervised ervised learning; we describ describee these categories and give some Most machine learning algorithms can befrom divided the categories of sup ervised examples of simple learning algorithms eachinto category category. . Most deep learning learning andare unsup ervised learning; we describ e thesecalled categories and give somet algorithms based on an optimization algorithm stochastic gradien gradient examples ofe simple learning from each category . Most deep learning descen descent. t. W describe how toalgorithms com combine bine various algorithm comp components onents suc such h as an algorithms are based on an optimization algorithm called stochastic gradient descent. We describe how to combine v98 arious algorithm components such as an 98
CHAPTER 5. MACHINE LEARNING BASICS
optimization algorithm, a cost function, a model, and a dataset to build a mac machine hine learning algorithm. Finally Finally,, in Sec. 5.11, we describe some of the factors that hav havee optimization algorithm, a cost function, model, and a dataset toThese build challenges a machine limited the ability of traditional mac machine hinea learning to generalize. learning algorithm. , in Sec. e describe some of the that these have ha hav ve motiv motivated ated the Finally dev developmen elopmen elopment t of 5.11 deep, w learning algorithms thatfactors ov overcome ercome limited the ability of traditional mac hine learning to generalize. These challenges obstacles. have motivated the development of deep learning algorithms that overcome these obstacles.
5.1
Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But 5.1 Learning Algorithms what do we mean by learning? Mitchell (1997) provides the definition “A computer A machine algorithm is anerience algorithm thatresp is able learnclass fromofdata. E with T program is learning said to learn from exp experience respect ect totosome tasksBut what do w e mean b y learning? Mitchell ( 1997 ) provides the definition “A computer and performance measure P , if its performance at tasks in T , as measured by P , E witha resp program saidexp toerience learn from experience to v some of eriences tasks T E .” One impro improv ves is with experience can imagine veryect wide arietyclass of exp experiences P, and performance measure P ,measures if its performance in Te, any as measured , tasks dotasks not mak make attempt inbythis E T , and performance P , and weat E .” One can impro vesprovide with exp imagine wide ariety experiences b ook to a erience formal definition of what ma may yabveery used for veac each h ofofthese entities. , tasks , and p erformance measures , and we do not mak e any attempt E T P Instead, the follo following wing sections provide intuitiv intuitivee descriptions and examples in of this the b o ok to provide a formal definition of what ma y b e used for eac h of these entities. differen differentt kinds of tasks, performance measures and exp experiences eriences that can be used Instead, the follo wing sections provide intuitiv e descriptions and examples of the to construct machine learning algorithms. different kinds of tasks, performance measures and experiences that can be used to construct machine learning algorithms.
5.1.1
The Task, T T Mac Machine hine The learni learning allo allows ws us to tac tackle kle tasks that are too difficult to solv solvee with 5.1.1 Tng ask,
fixed programs written and designed by human beings. From a scien scientific tific and Mac hine learni ng allo ws us to tac kle tasks that are too difficult to solv e with philosophical point of view, machine learning is in interesting teresting because developing our fixed programs written and designed b y human b eings. F rom a scien tific and understanding of machine learning en entails tails developing our understanding of the philosophical of view, machine learning is interesting because developing our principles thatpoint underlie intelligence. understanding of machine learning entails developing our understanding of the In this that relativ relatively ely formal definition of the word “task,” the pro process cess of learning principles underlie intelligence. itself is not the task. Learning is our means of attaining the ability to perform the relatively of bthe word the pro cess of learning task.InFthis or example, if formal we wan antdefinition t a rob robot ot to e able to “task,” walk, then walking is the task. itself is not the task. our to means abilitytotodirectly performwrite the W e could program theLearning rob robot ot to islearn walk,oforattaining we couldthe attempt For example, if we w ant to a rob ot man to bually e able. to walk, then walking is the task. atask. program that sp specifies ecifies how walk manually ually. We could program the rob ot to learn to walk, or we could attempt to directly write Mac Machine hine learning tasks are usually describ described ed in terms of ho how w the mac machine hine a program that specifies how to walk manually. learning system should pro an example . An example is a collection of fe process cess featur atur atures es learning tasksely aremeasured usually describ ed inob terms ho w the mac hine thatMac ha hav vhine e been quantitativ quantitatively from some object ject orofev even en ent t that we w an antt learning system shouldsystem processtoan example . eAn examplerepresen is a collection of featur the mac machine hine learning pro process. cess. W typically represent t an example ases a that ha v e b een quantitativ ely measured from some ob ject or ev en t that w e w an n vector x ∈ R where eac each h en entry try xi of the vector is another feature. For example,t the features machine of learning system to process. We typically represent an example as a the R an image are usually the values of the pixels in the image. vector x where each entry x of the vector is another feature. For example, the features ∈ of an image are usually the values of the pixels in the image. 99
CHAPTER 5. MACHINE LEARNING BASICS
Man Many y kinds of tasks can be solved with machine learning. Some of the most common mac machine hine learning tasks include the following: Many kinds of tasks can be solved with machine learning. Some of the most common machine learning include the computer following: program is ask • Classific Classification ation ation: : In this tasks typ ypee of task, the asked ed to sp specify ecify
whic which h of k categories some input belongs to. To solv solvee this task, the learning Classification : In thisask t yp e of task, the computer program ecify {1ask , . . ed . , kto }. sp f : Rn → is algorithm is usually to pro a function When asked ed produce duce categories input belongs to. To solv evector this task, learning • ywhic = fh(of x)k, the x tothe mo model del some assigns an input described byR a category . , k . When f : ts of the 1, . .classification algorithm isyusually ask ed duce are a function iden identified tified b numeric co code detoy.pro There other varian ariants y = f ( x x ) , the mo del assigns an input described b y vector to aer →{ }category task, for example, where f outputs a probability distribution ov over classes. idenexample tified by of numeric code y. There areob other variants of the classification An a classification task is object ject recognition, where the input f outputs task, for example, where a probability distribution overand classes. is an image (usually described as a set of pixel brightness values), the An example of a classification task is ob ject recognition, where the input output is a numeric co code de identifying the ob object ject in the image. For example, is an image (usually described as a set of pixel values), and the the Willo Willow w Garage PR2 rob robot ot is able to act as abrightness waiter that can recognize output tis kinds a numeric code and identifying ob ject in the on image. For example, differen of drinks deliv them to people command (Go different deliver er the Goo o dthe Willo w Garage PR2 rob ot is able to act as a w aiter that can recognize fello fellow w et al. al.,, 2010). Mo Modern dern ob object ject recognition is best accomplished with differen t kinds of drinks and deliv er them to and people on command (Goject oddeep learning (Krizhevsky et al., 2012 ; Ioffe Szegedy , 2015). Ob Object fello w et al. , 2010 ). Mo dern ob ject recognition is b est accomplished with recognition is the same basic tec technology hnology that allo allows ws computers to recognize deep learning ( Krizhevsky et al. , 2012 ; Ioffe and Szegedy, 2015tag ). Ob ject faces (Taigman et al. al.,, 2014), which can be used to automatically people recognition is the same teccomputers hnology that ws computers to recognize in photo collections andbasic allow to allo interact more naturally with faces ( T aigman et al. , 2014 ), which can b e used to automatically tag p eople their users. in photo collections and allow computers to interact more naturally with • Classific Classification ation with missing inputs inputs:: Classification b ecomes more challenging if their users. the computer program is not guaranteed that every measurement in its input Classific ation with missing inputsIn : Classification ecomes more challenging if v ector will alwa always ys b e pro provided. vided. order to solv solveeb the classification task, the the computer program nottoguaranteed that function every measurement in its input • learning algorithm onlyishas define a single mapping from a vector vector to will ys be pro vided. When In order to solv theinputs classification the input a alwa categorical output. some of ethe may be task, missing, learningthan algorithm onlya has to define a singlefunction, function the mapping from a vector rather providing single classification learning algorithm input to a categorical output. When some of the inputs may b e missing, must learn a set of functions. Eac Each h function corresp corresponds onds to classifying x with than providing singlemissing. classification function, the learning algorithm arather differen different t subset of its ainputs This kind of situation arises frequen frequently tly x m ust learn a set of functions. Eac h function corresp onds to classifying with in medical diagnosis, because many kinds of medical tests are exp expensiv ensiv ensivee or ain differen t subset of its inputs missing. This kind of situation arises frequen tly inv vasive. One wa way y to efficien efficiently tly define suc such h a large set of functions is to learn in medical diagnosis, b ecause many kinds of medical tests are exp ensiv e or a probability distribution over all of the relev relevant ant variables, then solv solvee the invasive. Onetask way by to efficien tly define suc h amissing large setvariables. of functions is ntoinput learn classification marginalizing out the With aariables, probability distribution over the relev ant variables, then solv e the v we can no now w obtain allall 2n ofdifferen different t classification functions needed classification task set by of marginalizing out the With an single input for eac each h possible missing inputs, butmissing we onlyvariables. need to learn v ariables, we can no w obtain all 2 differen t classification functions needed function describing the join jointt probabilit probability y distribution. See Go Goo odfello dfellow w et al. for eac h p ossible set of missing inputs, but w e only need to learn a (2013b) for an example of a deep probabilistic mo model del applied to such asingle task function describing thethe join t probabilit y distribution. Goodfello et bal. in this way ay. . Man Many y of other tasks describ described ed in thisSee section can w also e (generalized 2013b) for an example of a deep probabilistic mo del applied to such a task to work with missing inputs; classification with missing inputs is in this w ay . Manyof of the machine other tasks describ eddo. in this section can also be just one example what learning can generalized to work with missing inputs; classification with missing inputs is just one example of what machine100learning can do.
CHAPTER 5. MACHINE LEARNING BASICS
• Regr gression ession ession:: In this type of task, the computer program is ask asked ed to predict a numerical value giv given en some input. To solv solvee this task, the learning algorithm Reask gression In this taype of task,f the program is ask edistosimilar predicttoa is asked ed to: output function type of task → R. This : Rncomputer numerical value given that somethe input. To solve this task, the learning algorithm • classification, except format is differen different. t. An example of R R of output is ask ed to output a function . This type of task is similar to f : a regression task is the prediction of the exp expected ected claim amount that an classification, that the format of outputpremiums), is different.orAn of → insured personexcept will mak make e (used to set insurance theexample prediction a regression taskofissecurities. the prediction the exp ected claim are amount that for an of future prices Theseofkinds of predictions also used insured person will make (used to set insurance premiums), or the prediction algorithmic trading. of future prices of securities. These kinds of predictions are also used for algorithmic trading. ranscription anscription: : In this type of task, the machine learning system is ask asked ed to • T observ observee a relativ relatively ely unstructured represen representation tation of some kind of data and T r anscription : In this type of task, the machine learninginsystem asked to transcrib transcribee it into discrete, textual form. For example, opticalischaracter observe a relativ unstructured represen tation of some kind of data and • recognition, the ely computer program is shown a photograph containing an transcrib e it into discrete, textual form. F or example, in optical c haracter image of text and is asked to return this text in the form of a sequence recognition, computer isdeshown a photograph containing an of charactersthe (e.g., in ASCIIprogram or Unico Unicode format). Go Google ogle Street View uses image of text and is asked to return this text in the form of a sequence deep learning to process address num numbers bers in this wa way y (Go Goo o dfello dfellow w et al. al.,, of c haracters (e.g., in ASCII or Unico de format). Go ogle Street View uses 2014d Another example is sp recognition, where the computer program 2014d). ). speec eec eech h deep learning process address bersa in this waof y (cGo o dfellow al., is pro provided vided an to audio waveform andnum emits sequence haracters or et word 2014d ). Another example is speec h recognition, computer program ID co codes des describing the words that were sp spoken oken where in the the audio recording. Deep is pro vided an audio w a v eform and emits a sequence of c haracters or ord learning is a crucial comp component onent of mo modern dern speech recognition systems w used ID ma codes words that were spoken theGo audio at major jor describing companiesthe including Microsoft, IBMinand Google oglerecording. (Hin Hinton ton etDeep al. al.,, learning is a crucial comp onent of mo dern speech recognition systems used 2012b 2012b). ). at ma jor companies including Microsoft, IBM and Google (Hinton et al., ). tr Machine translation anslation anslation:: In a mac machine hine translation task, the input already consists • 2012b of a sequence of symbols in some language, and the computer program must Machine translation : In a mac task, the inputThis already consists con conv vert this in into to a sequence of hine sym symb btranslation ols in another language. is commonly of a sequence of symbols in some language, and the computer program must • applied to natural languages, such as to translate from English to Frenc rench. h. convert this into a sequence symbto olsha in another language. This is commonly Deep learning has recently of begun e an imp impact on this kind hav v important ortant applied natural languages, as to translate from of task (to Sutskev Sutskever er et al. al.,, 2014; such Bahdanau et al. al.,, 2015 ). English to French. Deep learning has recently begun to have an important impact on this kind Structur d output tasks et inv any).task where the output task (eSutskev er: Structured et al., 2014output ; Bahdanau al. , 2015 • of Structure output: involve olve is a vector (or other data structure con containing taining multiple values) with important Structur e d output : Structured output tasks involve where the output relationships bet etw ween the differen differentt elemen elements. ts. Thisany is atask broad category category, , and is a v ector (or other data structure con taining m ultiple v alues) with important • subsumes the transcription and translation tasks describ described ed ab abov ov ove, e, but also relationships b et w een the differen t elemen ts. This is a broad category , and man many y other tasks. One example is parsing—mapping a natural language subsumes the and tasksstructure described abtagging ove, butno also sen sentence tence in into to a transcription tree that describ describes es translation its grammatical and nodes des man y other tasks. One example is parsing—mapping a natural language of the trees as b eing verbs, nouns, or adverbs, and so on. See Collob Collobert ert (2011) sen tence in to a tree that describ es its grammatical structure and tagging nodes for an example of deep learning applied to a parsing task. Another example of pixel-wise the trees assegmen b eing verbs, or adverbs, and computer so on. Seeprogram Collobertassigns (2011) is segmentation tationnouns, of images, where the for anpixel example deep to learning applied to a. parsing task. Another example ev every ery in anofimage a sp specific ecific category category. For example, deep learning can is pixel-wise segmentation of images, where the computer program assigns every pixel in an image to a specific 101category. For example, deep learning can
CHAPTER 5. MACHINE LEARNING BASICS
be used to annotate the lo locations cations of roads in aerial photographs (Mnih and Hin Hinton ton, 2010). The output need not hav havee its form mirror the structure of b e used to annotate the lo cations of roads inyleaerial (Mnih and the input as closely as in these annotation-st annotation-style tasks.photographs For example, in image Hinton, 2010 ). computer The output need not haves e its the structure of captioning, the program observ observes an form imagemirror and outputs a natural the input as closely as in these annotation-st yle tasks. F or example, in image language sen sentence tence describing the image (Kiros et al., 2014a,b; Mao et al., captioning, theetcomputer program observ es ,an image and outputs 2015 ; Vin Vinyals yals al. al.,, 2015b ; Donahue et al. al., 2014 ; Karpath Karpathy y and aLinatural , 2015; language sen tence describing the image ( Kiros et al. , 2014a , b ; Mao et al., Fang et al. al.,, 2015; Xu et al. al.,, 2015). These tasks are called structured output 2015 yals et , 2015b; must Donahue et al. , 2014v;alues Karpath andall Li,tightly 2015; tasks; Vin because theal.program output several thaty are F ang et al. , 2015 ; Xu et al. , 2015 ). These tasks are called structured output in inter-related. ter-related. For example, the words pro produced duced by an image captioning tasks because must output several values that are all tightly program must the formprogram a valid sen sentence. tence. inter-related. For example, the words produced by an image captioning • A nomaly m dete detection ction: : In of task, the computer program sifts through program ustction form a vthis alidtype sentence. a set of even events ts or ob objects, jects, and flags some of them as being unusual or at atypical. ypical. A nomaly dete ction : In this type of task, the computer program sifts through An example of an anomaly detection task is credit card fraud detection. By a set of even ts or ob jects, and flags asome of them as beingyunusual or atmisuse ypical. • mo modeling deling your purchasing habits, credit card compan company can detect An example of an detection task iscard credit fraud By of your cards. If aanomaly thief steals your credit or card credit carddetection. information, modeling our hases purchasing habits, a credit compan y can detect misuse the thief thief’s ’sypurc purchases will often come from acard different probabilit probability y distribution of your cards. If a thief steals y our credit card or credit card information, over purc purchase hase typ ypes es than your own. The credit card company can prev preven en entt the thief purchases will on often from different probabilit y distribution fraud by’splacing a hold an come account asaso as that card has been used soon on overanpurc hase types than your own. credit etcard preven for uncharacteristic purchase. See The Chandola al. company (2009) forcan a survey oft fraud by detection placing a metho hold on anomaly methods. ds.an account as soon as that card has been used for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of • Synthesis and sampling sampling: : Inds.this type of task, the mac machine hine learning algorithm anomaly detection metho is ask asked ed to generate new examples that are similar to those in the training Synthesis and sampling : In thisvia type of task,learning the maccan hineblearning algorithm data. Syn Synthesis thesis and sampling machine e useful for media is asked to generate new examples that are similar to those in the training • applications where it can be exp expensiv ensiv ensivee or boring for an artist to generate large data. Syn thesis and sampling via machine learning can bcan e useful for media volumes of conten contentt by hand. F For or example, video games automatically applications where it can b e exp ensiv e or b oring for an artist to generate large generate textures for large ob objects jects or landscap landscapes, es, rather than requiring an vartist olumes conten by hand. or example, games cansome automatically to of man manually uallyt label eachFpixel (Luo etvideo al., 2013 ). In cases, we generate textures for large ob jects or landscap es, rather than requiring w an antt the sampling or synthesis pro procedure cedure to generate some specific kind an of artist to man ually label each pixel ( Luo et al. , 2013 ). In some cases, we output giv given en the input. For example, in a sp speec eec eech h synthesis task, we provide a w an t the sampling or synthesis pro cedure to generate specific kind of written sentence and ask the program to emit an audio some waveform con containing taining givversion en the input. or example, in aisspaeec h synthesis task, w e provide a aoutput sp spok ok oken en of thatFsentence. This kind of structured output task, written sentence and qualification ask the program emitisannoaudio wacorrect veform output containing but with the added thattothere single for a sp ok en version of that sentence. This is a kind of structured output task, eac each h input, and we explicitly desire a large amoun amountt of variation in the output, but with the added qualification that there no realistic. single correct output for in order for the output to seem more naturalisand each input, and we explicitly desire a large amount of variation in the output, • Imputation values values: : In thisnatural type ofand task, the machine learning in order for of themissing output to seem more realistic. algorithm is giv given en a new example x ∈ Rn, but with some entries xi of x Imputation missing must valuesprovide : In this type of task, machine missing. Theofalgorithm a prediction of thethe values of thelearning missing R x x algorithm is giv en a new example , but with some entries of x • en entries. tries. missing. The algorithm must provide a∈prediction of the values of the missing 102 entries.
CHAPTER 5. MACHINE LEARNING BASICS
• Denoising Denoising:: In this type of task, the machine learning algorithm is given in input a corrupte orrupted d example x˜ ∈ Rn obtained by an unknown corruption pro process cess Denoising : In this t ype of task, the machine learning algorithm is given in n from a cle clean an example x ∈ R .RThe learner must predict the clean example ˜ x input a cits orrupte d example obtained by an unknown corruption process • x ˜, or from corrupted version more generally predict the conditional R x from a cle an example . The learner m ust predict the clean example x ˜ ). probabilit probability y distribution p(x∈| x ˜, or more generally predict the conditional x from its corrupted version x ∈ ˜ ). mass function estimation: In the densit probabilit y distribution pob (xability • Density x estimation or pr prob obability density y estimation problem, the machine learning algorithm is asked to learn a | Density estimation abilityp mass(xfunction density pmodel : Rn or → pr Rob function , where ) can be estimation interpreted: asIna the probability model estimation problem, thecontin machine algorithm is asked to learn • densit density y function tin tinuous) uous)learning or a probabilit probability y mass function (if x isa R(if x isRcon p (xw function pon the: space that , where ) ere candrawn be interpreted probability discrete) the examples from. Toasdoa suc such h a task x densit y function (if is con tin uous) or a probabilit y mass function (if x is → well (w (wee will sp specify ecify exactly what that means when we discuss performance discrete) onP ), thethe space that the needs examples were drawn from. Toof dothe suchdata a task measures algorithm to learn the structure it w ell (w e will sp ecify exactly what that means when we discuss performance has seen. It must kno know w where examples cluster tigh tightly tly and where they P measures ), the algorithm needs to learn the structure of thethat datathe it are unlik unlikely ely to occur. Most of the tasks describ described ed ab abo ove require has seen. It m ust kno w where examples cluster tigh tly and where they learning algorithm has at least implicitly captured the structure of the are unlikely to occur. Most of y the tasks describ ed us abotoveexplicitly require that the probabilit Densit estimation allo capture probability y distribution. Density allows ws learning algorithmInhas at leastweimplicitly the structureon ofthat the that distribution. principle, can then captured p erform computations probability distribution. Densit y estimation us toFor explicitly capture distribution in order to solve the other tasksallo aswswell. example, if we that distribution. In principle, w e can then p erform computations on ( x), ha hav ve performed density estimation to obtain a probabilit probability y distribution pthat distribution in order to solvetothe other as vwalue ell. imputation For example, if wIfe w e can use that distribution solv solve e the tasks missing task. p( en, x), e performed densityand estimation to other obtainvalues, a probabilit y distribution ahavvalue all of the denoted given, x i is missing x −i, are giv w e can that distribution to ov solv task. If p(xi imputation | x −i). In practice, then weuse kno know w the distribution over ere itthe is missing given byvalue a v alue is missing and all of the other v alues, denoted , are en, x x densit density y estimation do does es not alw alwaays allow us to solve all of these relatedgiv tasks, (x) arex computationally then we in kno w thecases distribution overop iterations is givenon byp(px ). In practice, b ecause many the required operations densit y estimation does not always allow us to solve all of| these related tasks, in intractable. tractable. because in many cases the required operations on p( x) are computationally intractable. Of course, many other tasks and types of tasks are possible. The typ ypes es of tasks we list here are in intended tended only to pro provide vide examples of what machin machinee learning can Of course, many other tasks and types of tasks are p ossible. The types of tasks do, not to define a rigid taxonomy of tasks. we list here are intended only to provide examples of what machine learning can do, not to define a rigid taxonomy of tasks.
5.1.2
The Performance Measure, P P In order The to ev evaluate aluate the abilities of a mac machine hine learning algorithm, we must design 5.1.2 Performance Measure,
a quantitativ quantitativee measure of its performance. Usually this performance measure P is In orderto tothe evaluate abilities of aout mac sp specific ecific task Tthe b eing carried byhine thelearning system. algorithm, we must design a quantitative measure of its performance. Usually this performance measure P is For tasks such as classification, classification with missing inputs, and transcripspecific to the task T b eing carried out by the system. tion, we often measure the ac accur cur curacy acy of the mo model. del. Accuracy is just the prop proportion ortion F or tasks such as classification, classification with missing inputs, and transcripof examples for which the mo model del pro produces duces the correct output. We can also obtain tion, we often measure the accuracy of the model. Accuracy is just the proportion of examples for which the model produces 103 the correct output. We can also obtain
CHAPTER 5. MACHINE LEARNING BASICS
equiv equivalent alent information by measuring the err error or rate ate,, the prop proportion ortion of examples for whic which h the mo model del produces an incorrect output. We often refer to the error rate as equiv alent information theaerr or rate, the proportion for the exp expected ected 0-1 loss. by Themeasuring 0-1 loss on particular example is 0 ifofitexamples is correctly which theand model output. We often refer to itthe error ase classified 1 ifproduces it is not. an Forincorrect tasks suc such h as density estimation, do does es notrate mak make the exp ected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly sense to measure accuracy accuracy,, error rate, or any other kind of 0-1 loss. Instead, we classified 1 if it is not. For tasks suchthat as density estimation, it does not alued make m ust use and a different performance metric gives the mo model del a contin continuous-v uous-v uous-valued sense for to measure accuracyThe , error rate, or any approach other kindisofto0-1 loss. Instead, we score eac each h example. most common rep report ort the average must use a different performance gives the model a continuous-valued log-probabilit log-probability y the mo model del assigns metric to somethat examples. score for each example. The most common approach is to report the average Usually we are in interested terested in ho how w well the mac machine hine learning algorithm performs log-probability the model assigns to some examples. on data that it has not seen before, since this determines ho how w well it will work when Usually we are in terested in ho w w ell the mac hine learning algorithm measures performs deplo deploy yed in the real world. We therefore ev evaluate aluate these performance on data thatset it has not seen sincefrom thisthe determines hofor w well it willthe work when using a test of data that b isefore, separate data used training mac machine hine deplo y ed in the real w orld. W e therefore ev aluate these p erformance measures learning system. using a test set of data that is separate from the data used for training the machine The choice of performance measure may seem straightforw straightforward ard and ob objectiv jectiv jective, e, learning system. but it is often difficult to choose a performance measure that corresponds well to choice of performance measure may seem straightforward and ob jective, the The desired behavior of the system. but it is often difficult to choose a performance measure that corresponds well to In some cases, this is because it is difficult to decide what should b e measured. the desired behavior of the system. For example, when performing a transcription task, should we measure the accuracy In system some cases, this is because is difficult or to should decide wwhat e measured. of the at transcribing en entire tireitsequences, e useshould a morebfine-grained F or example, when performing a transcription task, should we measure the accuracy performance measure that giv gives es partial credit for getting some elemen elements ts of the of the system at transcribing en tire sequences, or should w e use a more fine-grained sequence correct? When performing a regression task, should we penalize the performance that es partial credit for mistakes getting some ts mak of the system more measure if it frequen frequently tlygiv mak makes es medium-sized or if elemen it rarely makes es sequence correct? When p erforming a regression task, should we penalize the very large mistakes? These kinds of design choices dep depend end on the application. system more if it frequently makes medium-sized mistakes or if it rarely makes In other cases, we kno know w what quan quantit tit tity y we would ideally like to measure, but very large mistakes? These kinds of design choices depend on the application. measuring it is impractical. For example, this arises frequently in the context of In yother cases, wMany e knowofwhat tity we wouldmo ideally like to measure, but densit density estimation. the bquan est probabilistic models dels represen represent t probabilit probability y measuring it is impractical. F or example, this arises frequently in the context of distributions only implicitly implicitly.. Computing the actual probabilit probability y value assigned to densit y estimation. Many of the b est probabilistic mo dels represen t probabilit a sp specific ecific poin ointt in space in man many y suc such h mo models dels is in intractable. tractable. In these cases, oney distributions only implicitly . Computing the actual probabilit y v alue assigned to must design an alternativ alternativee criterion that still corresp corresponds onds to the design ob objectives, jectives, a sp ecific apoin space in manytosuc h desired models criterion. is intractable. In these cases, one or design go goo otdinapproximation the must design an alternative criterion that still corresponds to the design ob jectives, or design a good approximation to the desired criterion.
5.1.3
The Exp Experience, erience, E E Mac Machine hine The learning unsupervise ervise ervised d or su5.1.3 Expalgorithms erience, can be broadly categorized as unsup pervise ervised d by what kind of exp experience erience they are allow allowed ed to ha hav ve during the learning Mac hine learning algorithms can be broadly categorized as unsupervised or supro process. cess. pervised by what kind of experience they are allowed to have during the learning Most of the learning algorithms in this book can be understo understoood as being allow allowed ed process. to exp experience erience an en entire tire dataset. A dataset is a collection of many examples, as Most of the learning algorithms in this book can be understood as being allowed 104 is a collection of many examples, as to experience an entire dataset. A dataset
CHAPTER 5. MACHINE LEARNING BASICS
defined in Sec. 5.1.1. Sometimes we will also call examples data points oints.. One of the oldest datasets studied by statisticians and mac machine hine learning redefined in Sec. 5.1.1. Sometimes we will also call examples data points. searc searchers hers is the Iris dataset (Fisher, 1936). It is a collection of measuremen measurements ts of One of the oldest datasets studied b y statisticians and mac hine learning redifferen differentt parts of 150 iris plants. Eac Each h individual plant corresp corresponds onds to one example. searcfeatures hers is the Iris eac dataset (Fisher ). It is a collection ofofmeasuremen of The within each h example are, 1936 the measurements of each the parts oftsthe differen t parts of length, 150 iris sepal plants.width, Each individual plant to one plan plant: t: the sepal petal length andcorresp petal onds width. Theexample. dataset The features within eac h example are the measurements of each of the parts of are the also records which sp species ecies each plan plantt belonged to. Three differen differentt sp species ecies plan t: the sepal width, petal length and petal width. The dataset represen represented tedsepal in thelength, dataset. also records which species each plant belonged to. Three different species are Unsup Unsupervise ervise ervised d le learning arning algorithms exp experience erience a dataset containing many features, represen ted in the dataset. then learn useful prop properties erties of the structure of this dataset. In the con context text of deep Unsup ervise d le arning algorithms exp erience a dataset containing many features, learning, we usually wan antt to learn the entire probabilit probability y distribution that generated learn useful prop erties of as theinstructure of this dataset. In the con text of deep athen dataset, whether explicitly densit density y estimation or implicitly for tasks lik likee learning, w e usually w an t to learn the entire probabilit y distribution that generated syn synthesis thesis or denoising. Some other unsupervised learning algorithms perform other a dataset, whether explicitly as in densit y estimation or implicitly for tasks like roles, like clustering, whic which h consists of dividing the dataset into clusters of similar synthesis or denoising. Some other unsupervised learning algorithms perform other examples. roles, like clustering, which consists of dividing the dataset into clusters of similar Sup Supervise ervise ervised d le learning arning algorithms exp experience erience a dataset con containing taining features, but examples. eac each h example is also asso associated ciated with a lab label el or tar target get. For example, the Iris dataset Supervisedwith learning algorithms expiris erience dataset containing features, but is annotated the sp species ecies of each plant.a A supervised learning algorithm eachstudy example is also associated withtoa classify label or iris target . Ftsorinto example, the Irist dataset can the Iris dataset and learn plan plants three differen different sp species ecies is annotated with the sp ecies of each iris plant. A supervised learning algorithm based on their measurements. can study the Iris dataset and learn to classify iris plants into three different species Roughly sp speaking, eaking, unsup unsupervised ervised learning inv involves olves observing sev several eral examples based on their measurements. of a random vector x, and attempting to implicitly or explicitly learn the probaRoughly speaking, ervised learningprop involves several examples p(x),unsup bilit bility y distribution or some in interesting teresting properties ertiesobserving of that distribution, while x of a random v ector , and attempting to implicitly or explicitly learn the probasup supervised ervised learning inv involv olv olves es observing sev several eral examples of a random vector x and p ( x bilitasso y distribution ) , or some in teresting propto erties of that distribution, while an v alue or v ector , and learning predict by associated ciated y y from x, usually x sup ervised learning inv olv es observing sev eral examples of a random v ector and estimating p (y | x). The term sup supervise ervise ervised d le learning arning originates from the view of an asso ciated v alue or v ector , and learning predict usually by y y from the target y being pro provided vided by an instructor orto teac teacher her who sho shows wsx,the mac machine hine y x p ( estimating ). The term ervise d learning originates the view or of learning system what to do. In sup unsup unsupervised ervised learning, there isfrom no instructor y the target b eing pro vided b y an instructor or teac her who sho ws the mac hine teac teacher, her, and the |algorithm must learn to make sense of the data without this guide. learning system what to do. In unsupervised learning, there is no instructor or Unsup Unsupervised ervised learning and supervised learning are not formally defined terms. teacher, and the algorithm must learn to make sense of the data without this guide. The lines betw between een them are often blurred. Man Many y machine learning technologies can Unsup ervised learning and supervised learning are not formally defined states terms. be used to perform both tasks. For example, the chain rule of probability The lines eenxthem Manycan machine learning technologies can that for abetw vector ∈ Rnare , theoften jointblurred. distribution be decomposed as be used to perform both tasks. For example, the chain rule of probability states n R that for a vector x , the jointY distribution can be decomposed as p(x) = p(x i | x 1, . . . , xi−1). (5.1) ∈ i=1 p (x ) = p(x x , . . . , x ). (5.1) This decomp decomposition osition means that we can solve the ostensibly unsup unsupervised ervised problem of | mo modeling deling p( x) by splitting it in into to n sup supervised ervised learning problems. Alternativ Alternatively ely ely,, we This decomposition means that we can solve the ostensibly unsupervised problem of 105 modeling p( x) by splitting it into nY supervised learning problems. Alternatively, we
CHAPTER 5. MACHINE LEARNING BASICS
can solve the sup supervised ervised learning problem of learning p(y | x) by using traditional unsup unsupervised ervised learning technologies to learn the joint distribution p(x, y) and can solve the supervised learning problem of learning p(y x) by using traditional inferring p(x, y) and unsupervised learning technologies to learn | p(x, ythe ) joint distribution P p ( y | x ) = . (5.2) inferring 0 y0 p(x, y ) p(x, y) p ( y x ) = . (5.2) Though unsup unsupervised ervised learning and sup supervised ervisedp(learning x, y ) are not completely formal or distinct concepts, they do help to|roughly categorize some of the things we do with Though ervised learningTand supervised learning completely formal or mac machine hine unsup learning algorithms. raditionally raditionally, , people referare to not regression, classification distinct concepts,output they do help to roughly categorize some ofDensit the things we do with and structured problems as sup supervised ervised learning. Density y estimation in mac hine learning algorithms. T raditionally , p eople refer to regression, classification P unsupervised learning. supp support ort of other tasks is usually considered and structured output problems as supervised learning. Density estimation in Other arian ariants ts of the learningconsidered paradigmunsupervised are p ossible.learning. For example, in semisupp ort ofvother tasks is usually sup supervised ervised learning, some examples include a supervision target but others do ariants of thelearning, learningan paradigm are p ossible. For example, in seminot.Other In mvulti-instance en entire tire collection of examples is lab labeled eled as sup ervised learning, some examples include a supervision target but others do con containing taining or not containing an example of a class, but the individual mem members bers not. In m ulti-instance learning, an en tire collection of examples is lab eled as of the collection are not lab labeled. eled. For a recen recentt example of multi-instance learning containing or dels, not containing with deep mo models, see Kotziasan et example al. (2015of ). a class, but the individual members of the collection are not labeled. For a recent example of multi-instance learning Some machine learning algorithms do not just exp experience erience a fixed dataset. For with deep models, see Kotzias et al. (2015). example, reinfor einforccement le learning arning algorithms interact with an environmen environment, t, so there Some machine learning algorithms do not just exp erience a fixed dataset. For is a feedback lo loop op b et etw ween the learning system and its exp experiences. eriences. Suc Such h algorithms example, reinfor cement arning an environmen so there are bey eyond ond the scop scope e of le this book.algorithms Please seeinteract Sutton with and Barto (1998) ort,Bertsek Bertsekas as is a feedback lo op b et w een the learning system and its exp eriences. Suc h algorithms and Tsitsiklis (1996) for information ab about out reinforcement learning, and Mnih et al. are b ey ond the scop e of this b o ok. Please see Sutton and Barto (1998) or Bertsekas (2013) for the deep learning approach to reinforcemen reinforcement t learning. and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al. Most machine exp experience erience a dataset. A dataset can (2013 ) formac thehine deeplearning learningalgorithms approach simply to reinforcemen t learning. be described in many wa ways. ys. In all cases, a dataset is a collection of examples, Most hine collections learning algorithms simply exp erience a dataset. A dataset can whic which h aremac in turn of features. be described in many ways. In all cases, a dataset is a collection of examples, One common wa way y of describing a dataset is with a design matrix. A design which are in turn collections of features. matrix is a matrix containing a differen differentt example in each ro row. w. Each column of the One common wa y of describing a dataset is with a design . Acontains design matrix corresp corresponds onds to a differen differentt feature. For instance, the Irismatrix dataset matrix is a matrix a differen t example in This each ro w. Each column of the 150 examples with containing four features for each example. means we can represent matrix corresp onds to a differen t feature. F or instance, the Iris dataset contains the dataset with a design matrix X ∈ R150×4 , where Xi,1 is the sepal length of 150 four width features for each example. means we of can i ,Retc. plan plantexamples t i, Xi,2 is with the sepal of plant We willThis describ describe e most therepresent learning X design the datasetinwith matrixofXho , where is the sepal datasets. length of algorithms thisabdesign ook in terms how w they op operate erate on matrix plant i, X is the sepal width of plant∈i , etc. We will describe most of the learning Of course, to describ describee a dataset as a design matrix, it must b e possible to algorithms in this book in terms of how they operate on design matrix datasets. describ describee eac each h example as a vector, and each of these vectors must be the same size. Of course, e aFdataset as a ifdesign matrix, it mustofb ephotographs possible to This is not alw alwa ato ysdescrib possible. or example, you ha hav ve a collection describ e each example as aheigh vector, and each of these vectors must be the same size. with different widths and heights, ts, then different photographs will contain differen different t This is not alw a ys p ossible. F or example, if y ou ha v e a collection of photographs num umb bers of pixels, so not all of the photographs ma may y be describ described ed with the same with different widths and heigh ts, then different photographs will differen length of vector. Sec. 9.7 and Chapter 10 describe how to handlecontain different typest numbers of pixels, so not all of the photographs may be described with the same length of vector. Sec. 9.7 and Chapter106 10 describe how to handle different types
CHAPTER 5. MACHINE LEARNING BASICS
of such heterogeneous data. In cases lik likee these, rather than describing the dataset as a matrix with m ro rows, ws, we will describ describee it as a set containing m elemen elements: ts: of such heterogeneous data. In cases lik e these, rather than describing the dataset (1) (2) ( m ) does es not imply that any two example vectors {x , x , . . . , x } . This notation do m as a matrix with ro ws, we will describ e it as a set containing m elements: ( i ) ( j ) x and x ha hav ve the same size. . This notation does not imply that any two example vectors x ,x ,...,x In the case of sup supervised ervised learning, the example con contains tains a lab label el or target as x { and x have the } same size. well as a collection of features. For example, if we wan antt to use a learning algorithm the case of sup ervised learning, the example tains a ecify labelwhich or target as to pIn erform ob object ject recognition from photographs, we con need to sp specify ob object ject w ell ears as a in collection features. if wthis e wan t to a usenumeric a learning algorithm app appears each ofofthe photos.ForWexample, e migh mightt do with co code, de, with 0 to p erform ob ject recognition from photographs, w e need to sp ecify which ob ject signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working appears in eachcon of taining the photos. We matrix might do withobserv a numeric with Xde, with a dataset containing a design of this feature observations ationsco , we also0 signifying a person, 1 signifying a car, 2 viding signifying cat, etc.example Often when working pro provide vide a vector of labels y , with yi pro providing the alab label el for i. with a dataset containing a design matrix of feature observations X, we also Of course, sometimes the label ma may y be more than just a single num umb ber. For provide a vector of labels y , with y providing the label for example i. example, if we want to train a sp speec eec eech h recognition system to transcribe en entire tire Of course, theeach labelexample may besen more than a single umber. For sen sentences, tences, thensometimes the lab label el for sentence tence is ajust sequence of n words. example, if we want to train a sp eech recognition system to transcribe entire Just asthen therethe is no definition of sen sup supervised ervised unsup unsupervised ervised learning, sentences, labformal el for each example tence is and a sequence of words. there is no rigid taxonom taxonomy y of datasets or exp experiences. eriences. The structures describ described ed here Just as there is no formal definition of sup ervised and unsup ervised learning, co cov ver most cases, but it is alwa always ys possible to design new ones for new applications. there is no rigid taxonomy of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.
5.1.4
Example: Linear Regression
5.1.4definition Example: Linear Regression Our of a mac machine hine learning algorithm as an algorithm that is capable of improving a computer program’s performance at some task via exp experience erience is Our definition of a mac hine learning algorithm as an algorithm that is somewhat abstract. To make this more concrete, we presen presentt an example of acapable simple of improving a computer program’s somereturn task via experience is mac machine hine learning algorithm: line linear ar rpeerformance gr gression ession ession.. Wat e will to this example somewhat To make thismac more concrete, presentthat an example of a simple rep repeatedly eatedly abstract. as we introduce more machine hine learningwe concepts help to understand macbhine learning algorithm: linear regression. We will return to this example its ehavior. repeatedly as we introduce more machine learning concepts that help to understand As the name implies, linear regression solv solves es a regression problem. In other its behavior. words, the goal is to build a system that can tak takee a vector x ∈ Rn as input and As the linear solves In a regression In other y ∈ regression R as its output. predict thename value implies, of a scalar the case ofproblem. linear regression, R x w ords, the goal to build a system can Let takeyˆabveector as our input and the output is a is linear function ofR thethat input. the value that mo model del y predict the v alue of a scalar as its output. In the case of linear regression, ∈ predicts y should take on. We define the output to be the output is a linear function∈of the input. Let yˆ be the value that our model > x (5.3) = woutput predicts y should take on. We defineyˆ the to be where w ∈ Rn is a vector of par arameters ameters ameters. (5.3) yˆ =. w x R are values that control the behavior of the system. In this case, wi is Parameters where w is a vector of parameters. the co coefficien efficien efficientt that we multiply by feature xi before summing up the con contributions tributions ∈ P arameters are v alues that control the b ehavior of the system. In this case, whow is from all the features. We can think of w as a set of weights that determine x the co efficien t that w e m ultiply b y feature b efore summing up the con tributions eac each h feature affects the prediction. If a feature xi receiv receives es a positive weigh weightt wi , from all the features. We can think of w as a set of weights that determine how each feature affects the prediction. If 107 a feature x receives a positive weight w ,
CHAPTER 5. MACHINE LEARNING BASICS
then increasing the value of that feature increases the value of our prediction yˆ. If a feature receiv receives es a negative weigh eight, t, then increasing the value of that feature yˆ. then increasing the v alue of that feature value of ourinprediction decreases the value of our prediction. If aincreases feature’s the weigh eight t is large magnitude, If a feature a negative eight, thenIfincreasing value thatit feature then it has areceiv largeeseffect on the w prediction. a feature’sthe weigh weight t is of zero, has no decreases the v alue of our prediction. If a feature’s w eigh t is large in magnitude, effect on the prediction. then it has a large effect on the prediction. If a feature’s weight is zero, it has no We thus hav havee a definition of our task T : to predict y from x by outputting effect on the prediction. > yˆ = w x. Next we need a definition of our performance measure, P . We thus have a definition of our task T : to predict y from x by outputting Supp that we ha hav ve a design matrix of m example inputs that we will not yˆ = Suppose w xose . Next we need a definition of our performance measure, P . use for training, only for ev evaluating aluating ho how w well the model performs. We also hav havee Suppose that we hatargets ve a design matrix of correct that we of will not m example a vector of regression pro providing viding the valueinputs of y for each these use for training, only evaluating howbwell thefor model performs. We italso examples. Because thisfordataset will only e used ev evaluation, aluation, we call thehav teste y a vector of regression targets providing theascorrect aluethe of vector for each of these X (test) vand set set. . We refer to the design matrix of inputs of regression examples. Because this dataset will only b e used for ev aluation, we call it the test targets as y(test). set. We refer to the design matrix of inputs as X and the vector of regression One way of measuring the performance of the mo model del is to compute the me mean an targets as y . ( test ) squar squareed err error or of the mo model del on the test set. If yˆ giv gives es the predictions of the One w a y of measuring the p erformance of the mo del is to mo model del on the test set, then the mean squared error is given bycompute the mean squared error of the model on the test set. If yˆ gives the predictions of the X 1 ) 2 model on the test set, then the = mean squared is given MSEtest (5.4) ( yˆ (test)error ) i . by − y (test m 1 i MSE (5.4) = ( yˆ ) . y m ( test ) ( = y test) . In Intuitiv tuitiv tuitively ely ely,, one can see that this error measure decreases to 0 when yˆ − We can also see that =y Intuitively, one can see that this error measure decreases to 0 when yˆ . 1 ( test ) ( test ) 2 We can also see that || ||ˆ −y ||2 , yˆ MSE test = X (5.5) m 1 y = Euclidean yˆ MSE , bet (5.5) so the error increases whenever the distance etween ween the predictions m || − || and the targets increases. so the error increases whenever the Euclidean distance between the predictions o make a mac machine hine learning algorithm, we need to design an algorithm that andTthe targets increases. will improv improvee the weigh weights ts w in a way that reduces MSEtest when the algorithm T o make a mac hine learning need toset design an),algorithm that y(train) ). One is allo allow wed to gain exp experience erience by algorithm, observing awe training (X (train w will improv eythe weights in a wawe y that reduceslater, MSE in Sec. when5.5.1 the )algorithm in intuitiv tuitiv tuitive e wa way of doing this (which will justify is just to X , y is allo w ed to gain exp erience by observing a training set ( ). One minimize the mean squared error on the training set, MSEtrain . intuitive way of doing this (which we will justify later, in Sec. 5.5.1) is just to To minimize MSE e can on simply solve forset, where its gradien gradient t is 0: train, werror minimize the mean squared the training MSE . To minimize MSE
, we can simply solve for where its gradient is 0: ∇w MSEtrain = 0 (5.6) 1 (train) (train 0 )|| 22 = 0 (5.6) ⇒ ∇ w || ||ˆ yˆMSE − y= (5.7) m 1∇ 1 y yˆ =0 (5.7) ⇒ ∇wm || ||X X(train)w − y (train)|| 22 = 0 (5.8) m ⇒1∇ || − || 108 X w y =0 (5.8) m ⇒ ∇ || − ||
CHAPTER 5. MACHINE LEARNING BASICS
3 2
Linear regression example Linear regression example
0.50 MSE(train)
y
1 0 −1 −2 −3
0.55
Optimization of w Optimization of w
0.45 0.40 0.35 0.30 0.25
−1.0 −0.5
0.0 x1
0 .5
0.20
1.0
0.5
1.0 w1
1.5
w w w
y =w x
w
y =w x
w
w w w
> (train) (train) (train) (train) =0 X ⇒ ∇w X w−y w−y
(5.9)
)> (train =y0(train) = (5.9) X )> y (train w )+ y y (train)> X w) w y 0 ⇒ ∇w w> X (trainX − 2w> X (train ⇒∇ − − (5.10) w X X y +y y =0 w 2w X ⇒ 2X (train)> X (train)w − 2X (train)> y(train) = 0 (5.11) ⇒∇ − (5.10) −1 (train)> ( train ) > ( train ) ( train ) X= X X (5.11) Xw 2X X y y = 0 (5.12) ⇒ 2w ⇒ − X Xen by Eq. y 5.12 is known (5.12) w = Xwhose solution The system of equations is giv given as the normal equations quations..⇒Ev Evaluating aluating Eq. 5.12 constitutes a simple learning algorithm. system of solution is giv en by Eq.in5.12 is known as 5.1 the. For The an example of equations the linear whose regression learning algorithm action, see Fig. normal equations. Evaluating Eq. 5.12 constitutes a simple learning algorithm. worth noting term line linear ar regr gression ession is often used to the For Itanisexample of the that linear regression learning algorithm in action, seerefer Fig. to 5.1a. sligh slightly tly more sophisticated mo model del with one additional parameter—an intercept It is w orth noting that the term linear regression is often used to refer to a term b. In this mo model del slightly more sophisticated modelyˆ with parameter—an intercept = w >one x +additional b (5.13) term b. In this model so the mapping from parameters yˆto=predictions the w x + b is still a linear function but (5.13) mapping from features to predictions is now an affine function. This extension to so the functions mapping means from parameters to predictions is still a linear function the affine that the plot of the mo model’s del’s predictions still lo looks oksbut like a mapping from features to predictions is now an affine function. This extension to line, but it need not pass through the origin. Instead of adding the bias parameter means thatthe themo plot theonly model’s predictions still tloxoks likeana baffine , one functions can contin continue ue to use model del of with weigh eights ts but augmen augment with line, but it need not pass through the origin. Instead of adding the bias parameter 109 only weights but augment x with an b, one can continue to use the model with
CHAPTER 5. MACHINE LEARNING BASICS
extra entry that is alwa always ys set to 1. The weigh eightt corresp corresponding onding to the extra 1 entry pla plays ys the role of the bias parameter. We will frequen frequently tly use the term “linear” when extra entry always setthroughout to 1. The this weigh corresponding to the extra 1 entry referring to that affineisfunctions bto ok. plays the role of the bias parameter. We will frequently use the term “linear” when b is often The in intercept called this the bias referring totercept affine term functions throughout bo ok.parameter of the affine transformation. This terminology derives from the point of view that the output of the b is The intercept often biasabsence parameter ofythe affine transfortransformation is term biased to tow w ard bcalled eing bthe in the of an any input. This term mation. This terminology derives from the point of view that the output of the is differen differentt from the idea of a statistical bias, in which a statistical estimation b transformation is biased toward iny the absence input. This yterm algorithm’s exp expected ected estimate of abeing quantit quantity is not equal of to an they true quantit quantity . is different from the idea of a statistical bias, in which a statistical estimation Linear regression of courseofanaextremely simple and limited algorithm, algorithm’s expectedisestimate quantity is not equal to the learning true quantit y. but it provides an example of how a learning algorithm can work. In the subsequen subsequentt Linear is of courseofan extremely simple and limited learning sections weregression will describ describe e some the basic principles underlying learning algorithm, algorithm but it provides an example of how a learning algorithm can work. In the subsequen design and demonstrate ho how w these principles can be used to build more complicatedt sections we will describ e some of the basic principles underlying learning algorithm learning algorithms. design and demonstrate how these principles can be used to build more complicated learning algorithms.
5.2
Capacit Capacity y, Ov Overfitting erfitting and Underfitting
The central tral challenge mac machine hine learning is that we must perform well on 5.2 cen Capacit y, in Ov erfitting and Underfitting inputs—not just those on whic which h our mo model del was trained. The The cen tral c hallenge in mac hine learning is that we m ust perform well on abilit ability y to perform well on previously unobserv unobserved ed inputs is called gener generalization alization alization. . inputs—not just those on which our mo del was trained. The Typically ypically, , whenwell training a machineunobserv learninged mo model, del, weishav have e access toalization a training abilit y to perform on previously inputs called gener . set, we can compute some error measure on the training set called the tr training aining T,ypically a machine learning del, w we hav access to err error or or, and we, when reducetraining this training error. So far,mo what e ha hav vee describ described edaistraining simply set, w e can compute some error measure on the training set called the tr aining an optimization problem. What separates machine learning from optimization is err or , and w e reduce this training error. So far, what w e ha v e describ ed is simply that we wan wantt the gener generalization alization err error or, also called the test err error or, to be lo low w as well. an optimization problem. What separates machine learning from optimization is The generalization error is defined as the exp expected ected value of the error on a new that weHere wantthe theexp gener alization error,across also called the test errorinputs, , to be dra lowwn as from well. input. is taken differen expectation ectation different t possible drawn The generalization error iswedefined expected valueter of in thepractice. error on a new the distribution of inputs expect as thethe system to encoun encounter input. Here the expectation is taken across different possible inputs, drawn from We typically estimate the generalization error of a machine learning mo model del by the distribution of inputs we expect the system to encounter in practice. measuring its performance on a test set of examples that were collected separately e typically the generalization error of a machine learning model by fromWthe trainingestimate set. measuring its performance on a test set of examples that were collected separately ourtraining linear regression example, we trained the model by minimizing the fromInthe set. training error, In our linear regression 1example, we trained the model by minimizing the (train) (train) 2 || ||X X w − y ||2, (5.14) training error, m(train) 1 X error,w 1 y || (5.14) but we actually care ab about out the test ||X X (test),w − y (test) ||22. m m || − set when || we get to observ How can we affect performance the test but Ho wewactually care ab out the testonerror, . e only the X w y observe training set? The field of statistic statistical al le learning arning the theory ory pro provides vides some answ answers. ers. If the || e only the How can we affect performance on the test set|| when we − get to observ 110 training set? The field of statistical learning theory provides some answers. If the
CHAPTER 5. MACHINE LEARNING BASICS
training and the test set are collected arbitrarily arbitrarily,, there is indeed little we can do. If we are allo allowed wed to make some assumptions about ho how w the training and test set training and the test set are collected arbitrarily , there is indeed little we can do. are collected, then we can mak makee some progress. If we are allowed to make some assumptions about how the training and test set train and are generated by a probability distribution ov over er datasets are The collected, thentest wedata can mak e some progress. called the data gener generating ating pr pro ocess ess.. We typically mak makee a set of assumptions kno known wn The train dataassumptions are generated by a probability distribution overexamples datasets collectiv collectively ely asand thetest i.i.d. These assumptions are that the called data gener atingendent processfrom . We each typically mak e athat set of knotest wn in eac each hthe dataset are indep independent other, and theassumptions train set and collectiv ely as al the i.i.d. assumptions Thesethe assumptions are thatdistribution the examples set are identic identical ally ly distribute distributed d, dra drawn wn from same probability as in eac h dataset are indep endent from each other, and that the train set and test eac each h other. This assumption allo allows ws us to describe the data generating process set are identical lyy distribution distributed, dra wna from same probability with a probabilit probability ov over er singlethe example. The same distribution distribution as is eac h other. This assumption allo ws us to describe the data generating process then used to generate ev every ery train example and every test example. We call that with a probabilit y distribution ovdata er a gener singleating example. The same distribution is p data. This shared underlying distribution the generating distribution distribution, , denoted then used to generate ev ery train example and every test example. W e call that probabilistic framework and the i.i.d. assumptions allo allow w us to mathematically sharedthe underlying distribution data gener . This study relationship betw etween eenthe training errorating and distribution test error. , denoted p probabilistic framework and the i.i.d. assumptions allow us to mathematically One immediate connection can observe betw between een error. the training and test error study the relationship betweenwe training error and test is that the expected training error of a randomly selected mo model del is equal to the One immediate connection we can observe betw een the training test error exp expected ected test error of that mo model. del. Suppose we ha hav ve a probabilityand distribution the expected error of a randomly moset deland is equal to the pis(xthat , y ) and we sampletraining from it rep repeatedly eatedly to generateselected the train the test set. exp ected test error of that mo del. Suppose w e ha v e a probability distribution For some fixed value w , then the exp expected ected training set error is exactly the same as p ( x , y ) and w e sample from it rep eatedly generate theare train set and thethe test set. the exp expected ected test set error, b ecause both to exp expectations ectations formed using same w F or some fixed v alue , then the exp ected training set error is exactly the same as dataset sampling process. The only difference betw etween een the tw two o conditions is the the exp testtoset b ecause both expectations are formed using the same name wected e assign theerror, dataset we sample. dataset sampling process. The only difference between the two conditions is the Ofwcourse, we e use awe machine name e assignwhen to thew dataset sample.learning algorithm, we do not fix the parameters ahead of time, then sample b oth datasets. We sample the training set, course, when the we parameters use a machine learning algorithm, wethen do not fix the thenOfuse it to choose to reduce training set error, sample the parameters aheadthis of time, thenthe sample b othtest datasets. Wegreater samplethan the training test set. Under pro process, cess, exp expected ected error is or equalset, to then use it to c hoose the parameters to reduce training set error, then sample the the exp expected ected value of training error. The factors determining ho how w well a machine test set. Under this pro cess, the exp ected test error is greater than or equal to learning algorithm will perform are its abilit ability y to: the expected value of training error. The factors determining how well a machine learning algorithm will perror erform are its ability to: 1. Mak Make e the training small. 1. Mak training error small. and test error small. 2. Makee the gap betw etween een training 2.These Maketwthe gap betw een ond training and error small. o factors corresp correspond to the twotest cen central tral challenges in machine learning: underfitting and overfitting. Underfitting occurs when the model is not able to These two factors corresp ond to the wo training central challenges in machine learning: obtain a sufficien sufficiently tly lo low w error value on tthe set. Ov Overfitting erfitting occurs when underfitting and overfitting . Underfitting occurs when the model is not able to the gap betw etween een the training error and test error is to too o large. obtain a sufficiently low error value on the training set. Overfitting occurs when We can con control trol whether a mo model del is more lik likely ely to ov overfit erfit or underfit by altering the gap between the training error and test error is too large. its cap apacity acity acity.. Informally Informally,, a mo model’s del’s capacit capacity y is its abilit ability y to fit a wide variety of We can control whether a model is more likely to overfit or underfit by altering 111 y is its ability to fit a wide variety of its capacity. Informally, a mo del’s capacit
CHAPTER 5. MACHINE LEARNING BASICS
functions. Mo Models dels with lo low w capacit capacity y ma may y struggle to fit the training set. Mo Models dels with high capacit capacity y can overfit by memorizing properties of the training set that do functions. Models lowtest capacit not serv servee them wellwith on the set.y may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do way y towell con control trol thetest capacity not One servewa them on the set. of a learning algorithm is by choosing its hyp hypothesis othesis sp spac ac acee, the set of functions that the learning algorithm is allow allowed ed to One y tothe consolution. trol the F capacity of a the learning by choosing its select as wa being or example, linear algorithm regression is algorithm has the hypothesis spacefunctions , the set of functions that learningspace. algorithm is allow ed to set of all linear of its input as its the hypothesis We can generalize select as being the For example,rather the linear has linear regression tosolution. include polynomials, thanregression just linearalgorithm functions, inthe its set of all linear functions of its input as its hypothesis space. W e can generalize hyp ypothesis othesis space. Doing so increases the mo model’s del’s capacit capacity y. linear regression to include polynomials, rather than just linear functions, in its polynomial degree giv gives es us regression model del with whic which h we hypA othesis space. ofDoing so one increases thethe molinear del’s capacit y. mo are already familiar, with prediction A polynomial of degree one gives us the linear regression model with which we are already familiar, with prediction yˆ = b + wx. (5.15) By in intro tro troducing ducing x2 can learn a mo model del By introducing x can learn a model
yˆ =provided b + wx. to the linear regression model, (5.15) as another feature we that is quadratic as a function of x: as another feature provided to the linear regression model, we that is quadratic aswa1function yˆ = b + x + w2 x2 of . x: (5.16)
yˆ = b + w x + function w x . of its (5.16) Though this mo model del implements a quadratic , the output is still a linear function of the , so we can still use the normal equations Though model del in implements a quadratic function of add its more ,pthe output is to train this the mo model closed form. We can contin continue ue to ow owers ers of x as still a linearfeatures, functionfor of example the , so can still of usedegree the normal equations additional to obtain a we polynomial 9: to train the model in closed form. We can continue to add more powers of x as 9 a polynomial of degree 9: additional features, for example to obtain X (5.17) yˆ = b + wi xi . yˆ = b +
i=1
wx.
(5.17)
Mac Machine hine learning algorithms will generally perform best when their capacity is appropriate in regard to the true complexit complexity y of the task they need to perform Mac hine learning algorithms will generally perform est when their capacityt and the amount of training data they are provided with.bModels with insufficien insufficient X is appropriate in regard to the true complexit y of the task they need to perform capacit capacity y are unable to solv solvee complex tasks. Mo Models dels with high capacit capacity y can solve and the amount of training data they are provided with. Models with insufficien complex tasks, but when their capacit capacity y is higher than needed to solve the presen presenttt capacit y are unable to solve complex tasks. Models with high capacity can solve task they may ov overfit. erfit. complex tasks, but when their capacity is higher than needed to solve the present Fig. 5.2 sho shows ws this principle in action. We compare a linear, quadratic and task they may overfit. degree-9 predictor attempting to fit a problem where the true underlying function Fig. 5.2 sho wslinear this principle action.toW e compare a linear, is quadratic. The function isinunable capture the curv curvature ature quadratic in the trueand undegree-9 problem, predictor so attempting to fit problem where theistrue underlying function derlying it underfits. Thea degree-9 predictor capable of represen representing ting is quadratic. The linear function is unable the curv ature in many the true unthe correct function, but it is also capabletoofcapture representing infinitely other derlying problem, so it underfits. The degree-9 predictor is capable of represen ting functions that pass exactly through the training p oin oints, ts, because we hav havee more the correct function, but it is also capable of representing infinitely many other 112training p oints, b ecause we have more functions that pass exactly through the
CHAPTER 5. MACHINE LEARNING BASICS
parameters than training examples. We hav havee little chance of choosing a solution that generalizes well when so man many y wildly different solutions exist. In this example, parameters than training examples. We hav littletrue chance of choosing solution the quadratic mo model del is perfectly matched toe the structure of the atask so it that generalizes well when so man y wildly different solutions exist. In this example, generalizes well to new data. the quadratic model is perfectly matched to the true structure of the task so it generalizes well to new data.
x
y
x
y
So far we hav havee only describ described ed changing a mo model’s del’s capacity by changing the num umb ber of input features it has (and simultaneously adding new parameters So far w e hav e only describThere ed changing a mo del’s wcapacity by changing the asso associated ciated with those features). are in fact many ays of changing a mo model’s del’s n umbery.ofCapacit input yfeatures it has (andonly simultaneously adding new parameters capacit capacity Capacity is not determined by the choice of model. The mo model del asso ciated with those features). There are in fact many w a ys of changing a mo del’s sp specifies ecifies whic which h family of functions the learning algorithm can cho hoose ose from when capacit y . Capacit y is not determined only by the c hoice of model. mothe del varying the parameters in order to reduce a training ob objectiv jectiv jective. e. This isThe called ecifies which family of functions the can cthe hoose from when rsp epr epresentational esentational cap apacity acity of the mo model. del. learning In man many yalgorithm cases, finding best function varyingthis thefamily parameters in order to optimization reduce a training ob jectiv e. This is called the within is a very difficult problem. In practice, the learning representational cap acity offind thethe mobdel. In many but cases, finding function algorithm do does es not actually est function, merely onethe thatbest significantly within this family is a very difficult optimization problem. In practice, the learning reduces the training error. These additional limitations, such as the imp imperfection erfection algorithm do es not actually find the best function, but merely one that significantly reduces the training error. These additional 113 limitations, such as the imp erfection
CHAPTER 5. MACHINE LEARNING BASICS
of the optimization algorithm, mean that the learning algorithm’s effe effective ctive cap apacity acity ma may y be less than the representational capacit capacity y of the mo model del family family.. of the optimization algorithm, mean that the learning algorithm’s effective capacity Our mo modern dern ideas ab about out impro improving ving the generalization of mac machine hine learning may be less than the representational capacity of the model family. mo models dels are refinements of though thoughtt dating bac back k to philosophers at least as early Our mo dern ideas ab out impro ving the generalization of machine as Ptolem Ptolemy y. Man Many y early scholars inv invok ok okee a principle of parsimony thatlearning is now models are refinements ofcthough t dating back to philosophers at least as early most widely kno known wn as Oc Occ am’s razor (c. 1287-1347). This principle states that as Ptolem y . Man y early scholars inv ok e a principle of parsimony that is among comp competing eting hyp ypotheses otheses that explain kno known wn observ observations ations equally well, now one most widely kno wn as Oc c am’s r azor (c. 1287-1347). This principle states that should cho hoose ose the “simplest” one. This idea was formalized and made more precise among comp eting hyp that explain known observ ations equally well, and one in the 20th century byotheses the founders of statistical learning theory (Vapnik shouldonenkis choose, the “simplest” idea et wasal. formalized and ,made more precise Cherv Chervonenkis 1971 ; Vapnik,one. 1982This ; Blumer al., , 1989; Vapnik 1995). in the 20th century by the founders of statistical learning theory (Vapnik and Statistical learning theory provides various means of quan quantifying tifying mo model del capacity capacity.. Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). Among these, the most well-kno well-known wn is the Vapnik-Chervonenkis dimension dimension,, or VC Statistical learning theory provides v arious means of quan tifying del capacity dimension. The VC dimension measures the capacity of a binary mo classifier. The. Among these, the most well-kno wnthe is the Vapnik-Chervonenkis dimension , or VC V C dimension is defined as being largest possible value of m for whic which h there dimension. The set VC of dimension measures capacity of a binary classifier. The. m differen exists a training different t x poin oints tsthe that the classifier can lab label el arbitrarily arbitrarily. VC dimension is defined as being the largest possible value of m for which there Quan Quantifying tifying the capacit capacity y of the mo model del allo allows ws statistical learning theory to exists a training set of m different x points that the classifier can label arbitrarily. mak makee quan quantitativ titativ titativee predictions. The most imp important ortant results in statistical learning Quan tifying y of the mo deltraining allows statistical learning theory to theory sho show w thatthe thecapacit discrepancy betw etween een error and generalization error mak e quantitativ e predictions. The ymost ortant results in capacity statisticalgro learning is bounded from ab abov ov ovee by a quantit quantity that imp gro grows ws as the mo model del grows ws but theory sho w that the discrepancy b etw een training error and generalization error, shrinks as the num umb ber of training examples increases (Vapnik and Cherv Chervonenkis onenkis onenkis, is bounded from ab;ovBlumer e by a quantit that gro ws as the ). moThese del capacity ws but 1971 ; Vapnik , 1982 et al. al.,, y1989 ;V apnik , 1995 boundsgro provide shrinks as the number ofthat training examples increases (Vapnik and Cherv onenkis in intellectual tellectual justification machine learning algorithms can work, but they are, 1971 ; Vused apnikin, 1982 ; Blumer al., 1989 ; Vapnik 1995). These boundsThis provide rarely practice when etworking with deep ,learning algorithms. is in in tellectual justification that machine learning algorithms can work, but they are part because the bounds are often quite lo loose ose and in part because it can be quite rarely used in practice when working with deep learning algorithms. This is difficult to determine the capacity of deep learning algorithms. The problem in of part becausethe thecapacity bounds of area often quite loose andisinesp part because it can be quite determining deep learning mo model del especially ecially difficult because the difficult determine the capacity of deep learning problemand of effectiv effectivee to capacity is limited by the capabilities of the algorithms. optimizationThe algorithm, determining capacityunderstanding of a deep learning movery del general is esp ecially difficult because the w e ha hav ve littlethe theoretical of the non-con non-conv vex optimization effectiv e capacity is limited b y the capabilities of the optimization algorithm, and problems in inv volv olved ed in deep learning. we have little theoretical understanding of the very general non-convex optimization We must remem rememb ber that while simpler functions are more likely to generalize problems involved in deep learning. (to hav havee a small gap betw etween een training and test error) we must still choose a W e must remem b er that while areerror. more likely to generalize sufficien sufficiently tly complex hyp ypothesis othesis to simpler ac achiev hiev hievee functions lo low w training Typically ypically, , training (to hav e a small gap b etw een training and test error) we m ust still a error decreases un until til it asymptotes to the minimum possible error value choose as mo model del sufficien complex hypothesis achiev e low training error. , training, capacit capacity ytly increases (assuming thetoerror measure has a minim minimum umTvypically alue). Typically Typically, error decreaseserror untilhas it asymptotes thee minimum possible errorcapacit value yas model generalization a U-shaped to curv curve as a function of model capacity . This is capacit y increases (assuming the error measure has a minim um v alue). Typically , illustrated in Fig. 5.3. generalization error has a U-shaped curve as a function of model capacity. This is To reach capacity y, we in intro tro troduce duce illustrated in the Fig. most 5.3. extreme case of arbitrarily high capacit the concept of non-p non-par ar arametric ametric models. So far, w wee hav havee seen only parametric To reach the most extreme case of arbitrarily high capacity, we intro duce the concept of non-parametric models. 114 So far, we have seen only parametric
CHAPTER 5. MACHINE LEARNING BASICS
mo models, dels, suc such h as linear regression. Parametric mo models dels learn a function describ described ed by a parameter vector whose size is finite and fixed before any data is observed. models, such as mo linear Pharametric models learn a function described Non-parametric models delsregression. hav havee no suc such limitation. by a parameter vector whose size is finite and fixed before any data is observed. Sometimes, non-parametric are just theoretical abstractions (suc (such h as Non-parametric models have no models such limitation. an algorithm that searches over all possible probability distributions) that cannot Sometimes, non-parametric models abstractions (such as be implemented in practice. How However, ever, weare canjust alsotheoretical design practical non-parametric an algorithm that their searches over all distributions) thatexample cannot mo models dels by making complexit complexity y apossible functionprobability of the training set size. One be such implemented in practice. How ever,orwerecan also .design non-parametric of an algorithm is ne near ar arest est neighb neighbor gr gression ession ession. Unlik Unlikeepractical linear regression, whic which h mo dels b y making their complexit y a function of the training set size. One example has a fixed-length vector of weigh eights, ts, the nearest neighbor regression model simply of such an algorithm is ne ar est neighb regrWhen ession.ask Unlik e linear regression, whicxh, stores the X and y from the trainingorset. asked ed to classify a test point has mo a fixed-length of weigh ts, the nearest neighbor regression model simply the model del lo looks oks upvector the nearest en entry try in the training set and returns the asso associated ciated x, stores the X and yInfrom thewords, training set.y When ask edarg to min classify a− test point 2 || ||X X x || yˆ = i = regression target. other where . i i,: 2 The the model can looks upbthe nearest ento trydistance in the training andthan returns associated 2 norm, Lthe algorithm also e generalized metrics set other the theL such arg min X x y ˆ = y i = regression target. In other words, where . The as learned distance metrics (Goldb Goldberger erger et al., 2005). If the algorithm is allo allow wed L algorithm can also b e generalized to distance metrics other than the norm, such || − || to break ties by averaging the yi values for all Xi,: that are tied for nearest, then as learned distance (Goldb etum al.p, ossible 2005). training If the algorithm is hallo wedt this algorithm is ablemetrics to achiev achieve e theerger minim minimum error (whic (which migh might y tical toe break by zero, averaging values for all that arewith tied differen for nearest, then b greaterties than if twothe iden identical inputs areXasso associated ciated different t outputs) thisan algorithm is able to achieve the minimum p ossible training error (which might on any y regression dataset. be greater than zero, if two identical inputs are associated with different outputs) Finally Finally,, we can also create a non-parametric learning algorithm by wrapping a on any regression dataset. parametric learning algorithm inside another algorithm that increases the num numb ber Finally, we can also create a non-parametric learning algorithm by wrapping a 115 parametric learning algorithm inside another algorithm that increases the number
CHAPTER 5. MACHINE LEARNING BASICS
of parameters as needed. For example, we could imagine an outer lo loop op of learning that changes the degree of the polynomial learned by linear regression on top of a ofolynomial parameters as needed. For input. example, we could imagine an outer loop of learning p expansion of the that changes the degree of the polynomial learned by linear regression on top of a The idealexpansion mo model del is an that simply knows the true probability distribution polynomial of oracle the input. that generates the data. Even suc such h a mo model del will still incur some error on man many y The ideal modelthere is an oracle thatbsimply thethe true probability distribution problems, because ma may y still e some knows noise in distribution. In the case that generateslearning, the data.the Even such afrom model willy still some errorsto on many x to of supervised mapping ma may yincur be inherently stoc chastic, problems, there mayfunction still be some noise in other the distribution. In thethose case y ma or may y bbeecause a deterministic that inv involv olv olves es variables besides x y of supervised learning, the mapping from to ma y be inherently sto c hastic, included in x. The error incurred by an oracle making predictions from the true or y may be pa(xdeterministic function inv distribution , y) is called the Bayesthat err error or or. . olves other variables besides those included in x. The error incurred by an oracle making predictions from the true Training and generalization error vary as the size of the training set varies. distribution p(x, y) is called the Bayes error. Exp Expected ected generalization error can nev never er increase as the num numb ber of training examples T raining and generalization error v ary as the size of the generalization training set varies. increases. For non-parametric mo models, dels, more data yields better until Expected caned. never increase the number ofdel training examples the best pgeneralization ossible error error is achiev achieved. Any fixed as parametric mo model with less than increases. For non-parametric morevdata yields betterthe generalization optimal capacit capacity y will asymptotemo todels, an error alue that exceeds Bay Bayes es error.until See the b est p ossible error is achiev ed. Any fixed parametric mo del with less than Fig. 5.4 for an illustration. Note that it is possible for the mo model del to hav havee optimal optimal capacit y will asymptote to an error v alue that exceeds the Bay es error.error. See capacit capacity y and yet still hav havee a large gap b et etw ween training and generalization Fig.this 5.4situation, for an illustration. it is possible forbthe model tomore have training optimal In we may beNote ablethat to reduce this gap y gathering capacity and yet still have a large gap b etween training and generalization error. examples. In this situation, we may be able to reduce this gap by gathering more training examples.
5.2.1
The No Free Lunc Lunch h Theorem
Learning theory claims a machine learning algorithm can generalize well from 5.2.1 The No Freethat Lunc h Theorem a finite training set of examples. This seems to con contradict tradict some basic principles of Learning theory claims thatorainferring machine general learningrules algorithm generalize well from logic. Inductiv Inductive e reasoning, from acan limited set of examples, a finite training vset of examples. This seems to describing contradict some principles of is not logically alid. T To o logically infer a rule ev every erybasic member of a set, logic.must Inductiv reasoning, orabout inferring general a limited set of examples, one hav haveeeinformation every mem memb brules er of from that set. is not logically valid. To logically infer a rule describing every member of a set, In part, mac machine hine learning av avoids oids this problem by offering only probabilistic rules, one must have information about every member of that set. rather than the entirely certain rules used in purely logical reasoning. Machine In part, machinetolearning avoids offeringab only rules, learning promises find rules thatthis areproblem pr prob ob obably ablybycorrect about out probabilistic most members of rather than the entirely certain rules used in purely logical reasoning. Machine the set they concern. learning promises to find rules that are probably correct about most members of Unfortunately Unfortunately,, even this do does es not resolve the en entire tire problem. The no fr freee lunch the set they concern. the theor or orem em for machine learning (Wolp olpert ert, 1996) states that, averaged ov over er all possible Unfortunately , even this do es not resolve the en tire problem. The free lunch data generating distributions, every classification algorithm has thenosame error the or em for machine learning ( W olp ert , 1996 ) states that, a v eraged ov er all p ossible rate when classifying previously unobserv unobserved ed points. In other words, in some sense, data generating distributions, every classification algorithm hasother. the same no machine learning algorithm is universally an any y better than any The error most rate when classifying previously unobserv points. In other words, in some sense, sophisticated algorithm we can conceive ofedhas the same average performance (o (ov ver no machine learning algorithm is universally an y b etter than any other. The most all possible tasks) as merely predicting that every point belongs to the same class. sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting 116 that every point belongs to the same class.
CHAPTER 5. MACHINE LEARNING BASICS
117
CHAPTER 5. MACHINE LEARNING BASICS
Fortunately ortunately,, these results hold only when we average ov over er al alll possible data generating distributions. If we mak makee assumptions about the kinds of probability Fortunately these results hold orld only applications, when we average over possible data distributions we, encounter in real-w real-world then we canal ldesign learning generating that distributions. If won e mak e assumptions algorithms perform well these distributions.about the kinds of probability distributions we encounter in real-world applications, then we can design learning This means that the goal of mac machine hine learning research is not to seek a universal algorithms that perform well on these distributions. learning algorithm or the absolute best learning algorithm. Instead, our goal is to This means that theof goal of machineare learning research not to seek that a universal understand what kinds distributions relev relevant ant to theis“real world” an AI learning algorithm or the absolute b est learning algorithm. Instead, our goal is on to agen agentt exp experiences, eriences, and what kinds of mac machine hine learning algorithms perform well understand what kinds of distributions are relev ant to the “real world” that an AI data dra drawn wn from the kinds of data generating distributions we care ab about. out. agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data generating distributions we care about.
5.2.2
Regularization
5.2.2no free Regularization The lunc lunch h theorem implies that we must design our mac machine hine learning algorithms to perform well on a sp specific ecific task. We do so by building a set of The no free lunc h theorem implies that we mthese ust design our mac learning preferences in into to the learning algorithm. When preferences arehine aligned with algorithms to perform w ell on a sp ecific task. W e do so by building a set of the learning problems we ask the algorithm to solv solve, e, it performs better. preferences into the learning algorithm. When these preferences are aligned with far, theproblems only metho method of the mo modifying difying a learning algorithm we ha have ve discussed is the So learning we dask algorithm to solve, it performs better. to increase or decrease the mo model’s del’s capacit capacity y by adding or remo removing ving functions from So far, the only metho d of mo difying a learning algorithm we ve discussed the hypothesis space of solutions the learning algorithm is able to ha choose. We ga gav vise to increase decreaseofthe model’s or capacit y by adding or remo functions for from the sp specific ecificor example increasing decreasing the degree of ving a polynomial a the hypothesis space The of solutions algorithm to choose. We gave regression problem. view wethe ha hav vlearning e described so far is is able oversimplified. the specific example of increasing or decreasing the degree of a polynomial for a The b eha ehavior vior of The our algorithm affected not by how large we regression problem. view we haisvestrongly described so far is ovjust ersimplified. mak makee the set of functions allow allowed ed in its hyp ypothesis othesis space, but by the sp specific ecific iden identit tit tity y The b eha vior of our algorithm is strongly affected not just by how large we of those functions. The learning algorithm we ha hav ve studied so far, linear regression, makae the set of functions allowed inofitsthe hyp othesis space, but by the specific tity has hypothesis space consisting set of linear functions of its input.iden These of thosefunctions functions. algorithm we havewhere studied far, linear regression, linear canThe be learning very useful for problems thesorelationship betw etween een has a hypothesis space consisting of the set of linear functions of its input. These inputs and outputs truly is close to linear. They are less useful for problems linear be nonlinear very useful for problems where the relationship betw een that bfunctions eha very fashion. For example, linear regression would ehav ve in acan inputs and outputs closetotouse linear. They are useful sin sin((less x ) from x . for not perform very welltruly if weistried it to predict We problems can thus that b eha v e in a very nonlinear fashion. F or example, linear regression would con control trol the performance of our algorithms by cho hoosing osing what kind of functions we sin ( x x not p erform v ery w ell if w e tried to use it to predict ) from . W e can thus allo allow w them to dra draw w solutions from, as well as by con controlling trolling the amoun amountt of these con trol the p erformance of our algorithms b y c ho osing what kind of functions we functions. allow them to draw solutions from, as well as by controlling the amount of these We can also give a learning algorithm a preference for one solution in its functions. hyp ypothesis othesis space to another. This means that both functions are eligible, but one W e can also a learning algorithm a preference its is preferred. Thegive unpreferred solution be chosen only if itfor fitsone thesolution trainingindata hypothesis to than another. This means that both functions are eligible, but one significan significantly tlyspace better the preferred solution. is preferred. The unpreferred solution be chosen only if it fits the training data For example, w wee can modify the training criterion for linear regression to significantly better than the preferred solution. include weight de deccay ay.. To perform linear regression with weigh weightt deca decay y, we minimize For example, we can modify the training criterion for linear regression to include weight decay. To perform linear118 regression with weight decay, we minimize
CHAPTER 5. MACHINE LEARNING BASICS
a sum comprising both the mean squared error on the training and a criterion J (w) that expresses a preference for the weigh weights ts to hav havee smaller squared L2 norm. a sum comprising both the mean squared error on the training and a criterion Sp Specifically ecifically ecifically, , J (w) that expresses a preference forMSE the weigh ts to>have smaller squared L (5.18) norm. J (w) = train + λw w , Specifically, where λ is a value chosen ahead that con controls the + λtrols w w J (w)of=time MSE , strength of our preference (5.18) for smaller weigh weights. ts. When λ = 00,, we imp impose ose no preference, and larger λ forces the where is abvecome alue chosen ahead of time that con trols theinstrength preference J (w w eigh eights tsλ to smaller. Minimizing ) results a choiceofofour weigh weights ts that λ λ for smaller weigh ts. When = 0 , we imp ose no preference, and larger forces mak makee a tradeoff bet etw ween fitting the training data and being small. This giv gives esthe us J (w w eights tothat become ) results in a cof hoice weightsAsthat solutions hav havee smaller. a smallerMinimizing slope, or put weigh eight t on fewer the of features. an make a tradeoff betcan ween fittinga the training data and berfit eingorsmall. This es ust example of ho how w we control mo model’s del’s tendency to ov overfit underfit viagiv weigh weight solutions have aa high-degree smaller slope, or put wregression eight on fewer of with the features. an deca decay y, we that can train polynomial mo model del differen differentt vAs alues example of ho w we can control a mo del’s tendency to ov erfit or underfit via weigh t of λ. See Fig. 5.5 for the results. decay, we can train a high-degree polynomial regression model with different values of λ. See Fig. 5.5 for the results.
λ λ
λ
λ
More generally generally,, we can regularize a mo model del that learns a function f ( x; θ ) by adding a penalty called a regularizer to the cost function. In the case of weigh eightt f ( x ; θ More generally , we can regularize a mo del that learns a function ) by > w) = w w. In Chapter 7, we will see that man deca decay y, the regularizer is Ω( Ω(w many y other adding a penalty called a regularizer to the cost function. In the case of weight decay, the regularizer is Ω(w) = w w. 119 In Chapter 7, we will see that many other
CHAPTER 5. MACHINE LEARNING BASICS
regularizers are possible. Expressing preferences for one function over another is a more general wa way y regularizers are possible. of con controlling trolling a model’s capacity than including or excluding members from the Expressing preferences for one function aovfunction er another general waasy hyp ypothesis othesis space. We can think of excluding fromisaahmore yp ypothesis othesis space of controlling a model’sstrong capacity than including excluding expressing an infinitely preference against or that function.members from the hypothesis space. We can think of excluding a function from a hypothesis space as In our weigh weightt deca decay y example, we expressed our preference for linear functions expressing an infinitely strong preference against that function. defined with smaller weigh eights ts explicitly explicitly,, v via ia an extra term in the criterion we In our weigh t deca y example, w e expressed our preference for linear minimize. There are man many y other ways of expressing preferences for functions differen differentt defined with smaller w eigh ts explicitly , v ia an extra term in the criterion we solutions, both implicitly and explicitly explicitly.. Together, these differen differentt approac approaches hes are minimize. are man kno known wn as reThere gularization gularization. . y other ways of expressing preferences for different solutions, both implicitly and explicitly. Together, these different approaches are known as regularization. Regularization is one of the cen central tral concerns of the field of machine learning, riv rivaled aled in its imp importance ortance only by optimization. Regularization is one of the central concerns of the The no free lunc lunch h theorem has made it clear that there is no best machine field of machine learning, rivaled in its importance only by optimization. learning algorithm, and, in particular, no best form of regularization. Instead The free alunc h theorem has madethat it clear that there best machine we m ust no choose form of regularization is well-suited to isthenoparticular task learning algorithm, and, in particular, no b est form of regularization. Instead we wan wantt to solv solve. e. The philosophy of deep learning in general and this book in w e must choose a form regularization that (suc is well-suited particulartasks task particular is that a veryofwide range of tasks (such h as all of to thethe in intellectual tellectual w e wan t to can solve. of effectiv deep learning general and thisose boforms ok in that people do)The mayphilosophy all be solved effectively ely usinginvery general-purp general-purpose particular is that a v ery wide range of tasks (suc h as all of the in tellectual tasks of regularization. that people can do) may all be solved effectively using very general-purpose forms of regularization.
5.3
Hyp Hyperparameters erparameters and Validation Sets
Most learning algorithms and hav havee sev several settings that we can use to con control trol 5.3 machine Hyperparameters Veral alidation Sets the behavior of the learning algorithm. These settings are called hyp hyperp erp erpar ar arameters ameters ameters.. Mostvalues machine learning algorithms e sev eral settings we can use to con trol The of hyperparameters arehav not adapted by thethat learning algorithm itself the behavior of the learning algorithm. settings are one called hyperpar ameters. (though we can design a nested learningThese pro procedure cedure where learning algorithm The values adapted by the learning algorithm itself learns the bof esthyperparameters hyperparametersare for not another learning algorithm). (though we can design a nested learning procedure where one learning algorithm In the regression example we saw in Fig.algorithm). 5.2, there is a single hyperlearns the pbolynomial est hyperparameters for another learning parameter: the degree of the polynomial, whic which h acts as a cap apacity acity hyp yperparameter. erparameter. In the p olynomial regression example w e saw in Fig. 5.2 , there is a single hyperThe λ value used to control the strength of weigh weightt decay is another example of a parameter: the degree of the p olynomial, whic h acts as a c ap acity h yp erparameter. hyp yperparameter. erparameter. The λ value used to control the strength of weight decay is another example of a a setting is chosen to b e a hyp yperparameter erparameter that the learning algohypSometimes erparameter. rithm does not learn because it is difficult to optimize. More frequently frequently,, we do Sometimes a setting is chosen to b e a h yp erparameter that learning algonot learn the hyp yperparameter erparameter because it is not appropriate to the learn that hyp ypererrithm does not learn b ecause it is difficult to optimize. More frequently , we do parameter on the training set. This applies to all hyperparameters that control not learn the h. yp because it set, is not appropriate to learnwould that alw hypaermo model del capacity capacity. If erparameter learned on the training such hyperparameters alwa ys parameter on the training set. This applies to all hyperparameters that control model capacity. If learned on the training 120set, such hyperparameters would always
CHAPTER 5. MACHINE LEARNING BASICS
cho hoose ose the maxim maximum um possible mo model del capacity capacity,, resulting in ov overfitting erfitting (refer to Fig. 5.3). For example, we can alw alwa ays fit the training set better with a higher choose pthe maximand um possible del setting capacity erfitting degree olynomial a weigh eightt mo decay of, λresulting = 0 thaninweovcould with(refer a lo low wto er Fig. 5.3 ). F or example, we can alw a ys fit the training set better with a higher degree polynomial and a positive weigh eightt deca decay y setting. degree polynomial and a weight decay setting of λ = 0 than we could with a lower To solve this problem, we need a validation set of examples that the training degree polynomial and a positive weight decay setting. algorithm do does es not observe. To solve this problem, we need a validation set of examples that the training Earlier we discussed ho how w a held-out test set, comp composed osed of examples coming from algorithm does not observe. the same distribution as the training set, can be used to estimate the generalization Earlier we discussed howlearning a held-out test has set, completed. composed ofItexamples coming error of a learner, after the pro process cess is imp important ortant thatfrom the the same distribution theintraining bee choices used to about estimate generalization test examples are not as used any waset, y tocan mak make thethe mo model, del, including error of a learner, afterF the learning process has completed. It is impset ortant the its hyperparameters. For or this reason, no example from the test can that be used testthe examples are set. not used in anywe waalwa y to ys mak e choices the about the model, in validation Therefore, always construct validation set including from the its hyperparameters. F or this reason, no example from the test set can be used tr training aining data. Specifically Specifically,, we split the training data in into to tw twoo disjoin disjointt subsets. One in these the validation weparameters. always construct the vsubset alidation setvfrom the of subsets isset. usedTherefore, to learn the The other is our alidation training Specifically we split the training intoortw o disjoin t subsets. One set, useddata. to estimate the, generalization error data during after training, allowing of these subsets is used to learn the parameters. The other subset is our v alidation for the hyperparameters to be updated accordingly accordingly.. The subset of data used to set, used to estimate the generalization errorthe during or after allowing learn the parameters is still typically called training set, training, ev even en though this for be updated . The of data pro used to ma may ythe be hyperparameters confused with thetolarger po pool ol of accordingly data used for the subset entire training process. cess. learnsubset the parameters is still typically called the set, even isthough The of data used to guide the selection of training hyperparameters called this the ma y b e confused with the larger po ol of data used for the entire training pro cess. validation set. Typically Typically,, one uses ab about out 80% of the training data for training and The subset of data used to guide the of to hyperparameters is called the 20% for validation. Since the validationselection set is used “train” the hyperparameters, vthe alidation set. set Typically , one uses about 80% of the trainingerror, data though for training and validation error will underestimate the generalization typically 20% validation. alidationerror. set isAfter usedall to h“train” the hyperparameters, by a for smaller amountSince than the the vtraining yp yperparameter erparameter optimization the v alidation set error will underestimate the generalization error, though is complete, the generalization error may be estimated using the test set.typically by a smaller amount than the training error. After all hyperparameter optimization In practice, when the same test set has been used rep repeatedly eatedly to ev evaluate aluate is complete, the generalization error may be estimated using the test set. performance of different algorithms over many years, and esp especially ecially if we consider In practice, the scientific same testcomm set has eatedly to ev aluate all the attempts when from the communit unit unity ybeen at bused eatingrep the reported state-ofperformance of different algorithms ver end many ears, and especially if we consider the-art performance on that test set,owe upyhaving optimistic ev evaluations aluations with all the attempts from the scientific comm unit y at b eating the reported state-ofthe test set as well. Benc Benchmarks hmarks can th thus us b ecome stale and then do not reflect the the-art performance on of that test set,system. we end up having optimistic evaluations with true field performance a trained Thankfully Thankfully, , the communit community y tends to the test set as well. Benc hmarks can th us b ecome stale and then do not reflect the mo mov ve on to new (and usually more am ambitious bitious and larger) benc enchmark hmark datasets. true field performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.
5.3.1
Cross-V Cross-Validation alidation
5.3.1 Cross-V alidation Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set being small. A small test set implies statistical uncertaint uncertainty y Dividingthe theestimated dataset into a fixed training set and it a fixed testtosetclaim can bthat e problematic around average test error, making difficult algorithm if itwresults in thethan test algorithm set being small. small test set implies statistical uncertainty A orks better B on A the given task. around the estimated average test error, making it difficult to claim that algorithm A works better than algorithm B on the given task. 121
CHAPTER 5. MACHINE LEARNING BASICS
When the dataset has hundreds of thousands of examples or more, this is not a serious issue. When the dataset is to too o small, there are alternative procedures, When the dataset has hundreds of thousands of examples this istest not whic which h allow one to use all of the examples in the estimationorofmore, the mean a serious issue. When the dataset is too small, there arepro alternative procedures, error, at the price of increased computational cost. These procedures cedures are based on whic h allow one to use all of the examples in the estimation of the mean test the idea of rep repeating eating the training and testing computation on different randomly the price of increased computational cost. ceduresofare based on cerror, hosenatsubsets or splits of the original dataset. TheThese most pro common these is the the idea of repalidation eating the training testing computation k -fold cross-v cross-validation pro procedure, cedure, and shown in Algorithm 5.1,on in different which a randomly partition cofhosen subsets or splits of the original dataset. The most common of these the the dataset is formed by splitting it in into to k non-o non-ov verlapping subsets. Theistest k -fold cross-v alidation pro cedure, shown in Algorithm 5.1 , in which a partition error ma may y then be estimated by taking the average test error across k trials. On k non-o of thei, dataset formed by splitting it inas tothe erlapping The test trial the i-th is subset of the data is used test v set and the subsets. rest of the data is error ma y then b e estimated b y taking the a v erage test error across trials. On k used as the training set. One problem is that there exist no unbiased estimators of trialviariance , the i-th the data used as the test setand andGrandv the rest of, the data is the of subset such avoferage errorisestimators (Bengio Grandvalet alet 2004 ), but used as the training set. One problem is that there exist no unbiased estimators of appro approximations ximations are typically used. the variance of such average error estimators (Bengio and Grandvalet, 2004), but approximations are typically used.
5.4
Estimators, Bias and Variance
The of statistics giv gives esBias us man many y to tools ols can be used to ac achiev hiev hievee the machine 5.4fieldEstimators, and Vthat ariance learning goal of solving a task not only on the training set but also to generalize. The field of statistics es husasman y tools that can be used to acvhiev e theare machine Foundational conceptsgivsuc such parameter estimation, bias and ariance useful learning goal of solving a task not only on the training set but also to generalize. to formally characterize notions of generalization, underfitting and overfitting. Foundational concepts such as parameter estimation, bias and variance are useful to formally characterize notions of generalization, underfitting and overfitting.
5.4.1
Poin ointt Estimation
5.4.1 Point Estimation P oin ointt estimation is the attempt to provide the single “best” prediction of some quan quantit tit tity y of interest. In general the quan quantit tit tity y of interest can be a single parameter Poin estimation is the attempt provide themo single “best” or a tvector of parameters in sometoparametric model, del, such as prediction the weigh eights tsofinsome our quan tit y of interest. In general the quan tit y of interest can be a single parameter linear regression example in Sec. 5.1.4, but it can also be a whole function. or a vector of parameters in some parametric mo del, such as the weights in our In order to distinguish estimates of parameters from their true value, our linear regression example in Sec. 5.1.4, but it can also be a whole function. con conv ven ention tion will be to denote a point estimate of a parameter θ by θˆ. In order to distinguish estimates of parameters from their true value, our Let {x(1), . . . , x (m)} be a set of m indep independen enden endentt and identically distributed convention will be to denote a point estimate of a parameter θ by θˆ. (i.i.d.) data points. A point estimator or statistic is an any y function of the data: Let x , . . . , x be a set of m independent and identically distributed (1) statistic (i.i.d.) data θˆm = g(xor , . . . , x(m)is).any function of the data: { points. A p}oint estimator (5.19) θˆ = g(xg return , . . . , x a v)alue . The definition do does es not require that that is close to the(5.19) true even en that the range of g is the same as the set of allo allow wable values of θ. θ or ev The definition that return a value thatwsisthe close to theoftrue This definition do of es a pnot oin ointtrequire estimator is gvery general and allo allows designer an or ev en that the range of is the same as the set of allo w able v alues of θ g θ. estimator great flexibility flexibility.. While almost an any y function thus qualifies as an estimator, This definition of a point estimator is very general and allows the designer of an estimator great flexibility. While almost122 any function thus qualifies as an estimator,
CHAPTER 5. MACHINE LEARNING BASICS
The k -fold cross-v cross-validation alidation algorithm. It can be used to estimate generalization error of a learning algorithm A when the giv given en dataset D is to too o k The -fold cross-v alidation algorithm. It can b e used to estimate small for a simple train/test or train/v train/valid alid split to yield accurate estimation of D A generalization error of a learning algorithm when the giv en dataset is to oo generalization error, because the mean of a loss L on a small test set ma may y ha hav ve to too smallvariance. for a simple or train/v split to accurate estimation of z(i) (for high The train/test dataset D con contains tains asalid elements theyield abstract examples generalization error,which because the mean of a loss L on a small test set(i)may xha (iv ) ,eyto (i)o) i-th example), the could D stand for an (input,target) pair z = ((x z case high datasetlearning, contains elements abstract (for z(i) =examples x(i) in the in thevariance. case of The sup supervised ervised orasfor just anthe input i z = ( x , y the -th example), which could stand for an (input,target) pair of unsupervised learning. The algorithm returns the vector of errors e for eac each h) = x The in the case of, whose supervised or for just an input z error. in errors the case example in D meanlearning, is the estimated generalization on of unsupervised learning. The algorithm returns the vector of errors for eac h e individual examples can be used to compute a confidence interv interval al around the mean D example in While , whose mean is the estimated generalization error. after The the errors (Eq. 5.47). these confidence interv intervals als are not well-justified useon of individual examples can b e used to compute a confidence interv al around the mean cross-v cross-validation, alidation, it is still common practice to use them to declare that algorithm A (Eq. 5.47than ). While these B confidence interv als are not well-justified after use of is better algorithm only if the confidence in of the algorithm interv terv terval al of the error A cross-v is still practice to use them to al declare that algorithm A lies balidation, elow anditdoes notcommon in intersect tersect the confidence in interv terv terval of algorithm B. is better than algorithm B only if the confidence interval of the error of algorithm D, A,not L, kin ):tersect the confidence interval of algorithm B. A lies below and (does D, the given dataset, with elemen elements ts z (i) D ( , A, L, k ): takes es a dataset as A D , the learning algorithm, seen as a function that tak , the given dataset, with elemen ts z input and outputs a learned function the loss learning algorithm, as a function takes a dataset as A ,, the f and function, seen asseen a function from a that learned function L input and outputs a learned function an example z(i) ∈ D to a scalar ∈ R thenum lossber function, seen as a function from a learned function f and L, ,the k number D of folds R an example to aexclusiv scalar e subsets D , whose union is D. Split D into zk mutually exclusive i k, the iD from 1 tonum k∈ ber of folds ∈ D D Split k mutually exclusive subsets , whose union is . f i = Ainto (D\D i) i from 1 to k z (j)Din DD i f e= A ) (j ) ) ( j = L(fD i, z z in \ e = L(f , z ) e e
123
CHAPTER 5. MACHINE LEARNING BASICS
a go goo od estimator is a function whose output is close to the true underlying θ that generated the training data. a good estimator is a function whose output is close to the true underlying θ that For no now, w, we tak takee the frequentist persp perspective ective on statistics. That is, we assume generated the training data. that the true parameter value θ is fixed but unknown, while the point estimate now, weoftak e the frequentist on statistics. That is,process, we assume θˆ isFaorfunction the data. Since thepersp dataective is drawn from a random any that the of true value θTherefore is fixed but the point estimate function theparameter data is random. θˆ isunknown, a randomwhile variable. θˆ is a function of the data. Since the data is drawn from a random process, any Poin ointt estimation can also refer to the estimation of the relationship b et etw ween function of the data is random. Therefore θˆ is a random variable. input and target variables. We refer to these types of point estimates as function Point estimation can also refer to the estimation of the relationship b etween estimators. input and target variables. We refer to these types of point estimates as function estimators. As we mentioned ab abo ove, sometimes we are in interested terested in performing function estimation (or function appro approximation). ximation). Here we are trying to As we mentioned ab o v e, sometimes are is interested in predict a variable y giv given en an input vector x . We assume thatwe there a function p erforming function estimation (or function appro ximation). Here w e are trying to f (x ) that describ describes es the approximate relationship b et etw ween y and x. For example, y giv predict variable We assume that there a function y en = fan ( xinput ) + ,vector w e may aassume that where x. stands for the part of yisthat is not f ( x y x ) that describ es the approximate relationship b et w een and . F or example, predictable from x. In function estimation, we are interested in appro approximating ximating y = f ( x ) + y w e may assume that , where stands for the part of that is not ˆ f with a model or estimate f . Function estimation is really just the same as x predictable from . In function estimation, w e are interested in appro ximating ˆ estimating a parameter θ; the function estimator f is simply a p oin ointt estimator in f with a space. fˆ. Function modelThe or estimate estimation is really just the5.1.4 same as function linear regression example (discussed abov abovee in Sec. ) and ˆ θ; the estimating a parameter function estimator is simply a pboth oint estimator the polynomial regression example (discussed in fSec. 5.2) are examples in of function space. The linear regression example (discussed abov e in Sec. 5.1.4 ) and w or estimating scenarios that ma may y be interpreted either as estimating a parameter parameterw the polynomial regression example (discussed in Sec. 5.2 ) are both examples of ˆ a function f mapping from x to y. scenarios that may be interpreted either as estimating a parameter w or estimating We no now w review the most commonly studied prop properties erties of poin ointt estimators and a function fˆ mapping from x to y. discuss what they tell us ab about out these estimators. We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.
5.4.2
Bias
The bias Bias of an estimator is defined as: 5.4.2 (5.20) bias(θˆm The bias of an estimator is defined as:) = E(θˆm ) − θ E ˆ samples from a random variable) and where the exp expectation ectation is over the data (5.20) bias( θˆ (seen ) = as (θ ) θ θ is the true underlying value of θ used to define−the data generating distribution. where the expθectation ovbere the datad (seen as(θˆsamples from a random variable) ˆm is saidisto ˆ and bias( An estimator unbiase unbiased if bias m) = 0, which implies that E( θm ) = θ. isAn theestimator true underlying value of θasymptotic used to define the data distribution. ˆ to be asymptotical al ally ly unbiase unbiased d if generating θ θˆm is said limm→∞ bias bias( (θ Emˆ) = 0, ˆ ˆ θ bias ( θ ) = 0 (θ ) = An estimator is said to b e unbiase d if , which implies that ˆ whic which h implies that limm→∞ E(θm) = θ. ˆ ˆ θ. An estimator θ is said to be asymptotically unbiased if lim bias(θ ) = 0, E ˆ (θ ) = θ. which implies that lim Consider a set of samples {x (1), . . . , x(m) } that are indep independently endently and iden identically tically distributed according to a Bernoulli distriConsider a set of samples x , . . . , x 124 that are independently and identically distributed according to a {Bernoulli distri}
CHAPTER 5. MACHINE LEARNING BASICS
bution with mean θ:
P (x (i); θ) = θ x (1 − θ)(1−x
)
. (5.21) bution with mean θ: A common estimator for Pthe is the mean (5.21) of the (x θ ;parameter θ) = θ (1of this θ) distribution . training samples: mof−this distribution is the mean of the A common estimator for the θ parameter X 1 θˆm = x(i) . (5.22) training samples: m i=1 1 θˆ = x . (5.22) To determine whether this estimator ism biased, we can substitute Eq. 5.22 in into to Eq. 5.20 5.20:: To determine whether this estimator is biased, we can substitute Eq. 5.22 into Eq. 5.20: bias(ˆθm) = E[θˆm] − θ (5.23) X " # m E ˆ1 X bias(θˆ ) = [ θ ] θ x (i) − θ (5.23) =E (5.24) m i=1 E 1− = (5.24) m hx i θ 1 X m ( i) = − (5.25) E x −θ m i=1 1 E = "X (5.25) m X 1x # θ m 1 (1−x ( i ) x ) X −θ (5.26) = x− θ (1 − θ) m 1 i=1 x =0 θ (5.26) = x θ (1 θ) m m 1 X h i X (θ) − θ = (5.27) − − m 1 i=1 = θ − θ =(θ0) θ (5.27) = (5.28) mX X − =θ θ=0 (5.28) Since bias(ˆθ) = 00,, we say that our estimator θˆ is un unbiased. biased. − X ˆ Since bias(θ) = 0, we say that our estimator θˆ is unbiased. No Now, w, consider a set of samples {x(1) , . . . , x(m) } that are indep and iden independen enden endently tly identically tically distributed ( i ) ( i ) 2 ∈w, {1,consider . . . , m}. according to a Gaussian distribution p(x ) = N (x ; µ, σ ), where iNo , . . . , x x aRecall set ofthat samples that are indep enden tly and iden tically distributed the Gaussian probability densit density y function is giv given en by p(x ) = (x ; µ, σ ), where i 1, . . . , m . according to a Gaussian distribution { } ! Recall that the Gaussian probability1 density function N 1 (x (i) is } − giv µ)2en by ∈ { p(x(i); µ, σ2 ) = √ exp − . (5.29) 2 2 σ 2π σ 2 1 1 (x µ) p(x ; µ, σ ) = exp . (5.29) 2 σ− 2 π σ A common estimator of the Gaussian mean−parameter is kno known wn as the sample √ me mean an an:: m A common estimator of the Gaussian mean parameter is ! known as the sample 1X µ ˆ = x(i) (5.30) m mean : m 1 i=1 µ ˆ = x (5.30) m 125 X
CHAPTER 5. MACHINE LEARNING BASICS
To determine the bias of the sample mean, we are again interested in calculating its exp expectation: ectation: To determine the bias of the sample mean, we are again interested in calculating its expectation: bias( bias(ˆ µ ˆ m ) = E[µ ˆm] − µ (5.31) " # m E 1 X bias(µ ˆ )= = E [µ ˆ ] µx(i) − µ (5.31) (5.32) m− i=1 E 1 !µ = x (5.32) m 1mX h (i) i = (5.33) E x − −µ m 1 i=1 E! = " X x# µ (5.33) m m 1 X (5.34) = µ −µ − m 1 i=1 (5.34) = =µ− =µ 0 h µi! (5.35) m µX − = µ µ = 0 Th Thus us we find that the sample mean is an unbiased estimator of Gaussian (5.35) mean ! − parameter. Xunbiased estimator of Gaussian mean Thus we find that the sample mean is an parameter. As an 2 example, we compare two differen differentt estimators of the variance parameter σ of a As an Gaussian distribution. We are in interested terested in kno knowing wing if either estimator is biased. σ example, we compare two differen t estimators of the v ariance parameter of a 2 we consider is known as the sample varianc The first estimator of σ variance e : Gaussian distribution. We are interested in knowing if either estimator is biased. m as2 the sample variance : The first estimator of σ we consider 1 X is(iknown 2 ) σ ˆm = x −µ ˆm , (5.36) m i=1 1 σ ˆ = x µ ˆ , (5.36) m abov ˆ m is the sample mean, defined where µ above. e.−More formally formally,, we are in interested terested in computing ˆ is the sample mean,bias( where µ defined abov 2 More2 formally, we are interested in bias(ˆ σ ˆ 2m) = E[σ ˆe.m (5.37) ]−σ computing X 2 ]: E We begin by ev evaluating aluating the term ˆm bias(Eσ ˆ[σ )= [σ ˆ ] σ (5.37) # E" − 2 We begin by evaluating the term [σ ˆ1 X ]:m 2 ( i) E[σ ˆm ] = =E E x −µ ˆm (5.38) m E E 1 i=1 [σ ˆ ] =m − 1 x µ ˆ (5.38) m σ2 (5.39) = − m m 1 (5.39) = σ ˆ2m #is −σ2 /m. Therefore, the Returning to Eq. 5.37, we conclude"m −that the bias of σ sample variance is a biased estimator. X ˆ is σ /m. Therefore, the Returning to Eq. 5.37, we conclude that the bias of σ sample variance is a biased estimator. − 126
CHAPTER 5. MACHINE LEARNING BASICS
The unbiase unbiased d sample varianc variancee estimator m 2 X The unbiased sample varianc e 1estimator 2 (5.40) σ ˜m = x(i) − µ ˆm m−1 1 i=1 (5.40) σ ˜ = x µ ˆ pro provides vides an alternative approac approach. h.mAs 1the name suggests this estimator is un unbiased. biased. − That is, we find that E[σ ˜ 2m ] = σ 2: − provides an alternative approach. As the name suggests this estimator is unbiased. " # E m That is, we find that [σ ˜ ]=σ : 2 1 XX 2 (i) ˆm E[σ ˜m] = E x −µ (5.41) m−1 i=1 1 E E µ ˆ [σ ˜ ]= m x (5.41) 2 m E[1σ ] (5.42) = ˆm m−1 − m − E ] 1 2 (5.42) = m [σ ˆm − = m" 1 σ (5.43) # m m−1 mX 1 m − 2 =σ . σ (5.43) (5.44) m m 1 − σ .− (5.44) We ha hav ve two estimators:=one is biased and the other is not. While unbiased estimators are clearly desirable, they are not alw alwa ays the “b “best” est” estimators. As we W e ha v e t w o estimators: one is biased and the other is not. While unbiased will see we often use biased estimators that possess other imp important ortant properties. estimators are clearly desirable, they are not always the “best” estimators. As we will see we often use biased estimators that possess other important properties.
5.4.3
Variance and Standard Error
Another prop propert ert erty y ofand the estimator that we migh mightt wan antt to consider is ho how w muc much h 5.4.3 V ariance Standard Error we exp expect ect it to vary as a function of the data sample. Just as we computed the Another propofert y ofestimator the estimator that we its migh t ww anetcan to consider w muche. exp expectation ectation the to determine bias, computeisitshovarianc variance w e exp ect it to v ary as a function of the data sample. Just as w e computed the The variance of an estimator is simply the variance expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply theθˆ)variance Var( (5.45) Var(θˆset. ) Alternately where the random variable is the training Alternately,, the square ro root ot (5.45) of the ˆ variance is called the standar standard d err error or or,, denoted SE(θ). where the random variable is the training set. Alternately, the square root of the The variance or the standard error of an estimator pro provides vides a measure of ho how w variance is called the standard error, denoted SE(θˆ). we would exp expect ect the estimate we compute from data to vary as we indep independently endently The variance or thefrom standard error of andata estimator provides acess. measure resample the dataset the underlying generating pro process. Justofasho wwe w e would exp ect the estimate we compute from data to v ary as w e indep endently migh mightt like an estimator to exhibit low bias we would also like it to ha hav ve relativ relatively ely resample the dataset from the underlying data generating pro cess. Just as we lo low w variance. might like an estimator to exhibit low bias we would also like it to have relatively When we compute an any y statistic using a finite num numb ber of samples, our estimate low variance. of the true underlying parameter is uncertain, in the sense that we could ha hav ve When we compute an y statistic using a finite num b er of samples, our estimate obtained other samples from the same distribution and their statistics would hav havee of the true underlying parameter is uncertain, in the sense that we could have 127 obtained other samples from the same distribution and their statistics would have
CHAPTER 5. MACHINE LEARNING BASICS
been different. The exp expected ected degree of variation in any estimator is a source of error that we wan wantt to quan quantify tify tify.. been different. The exp ected degree of variation in any estimator is a source of The standard error of the mean is giv given en by error that we want to quantify. v u is givenmby The standard error of the mean u 1 X (i) σ SE( SE(ˆ µ ˆ m) = tVar[ x ]= √ , (5.46) m m 1 i=1 σ SE(µ ˆ ) = Var[ x ]= , (5.46) m m 2 i √ standard error is often where σ is the true variance of the samples x . The estimated by using an estimate of σ. Unfortunately Unfortunately,, neither the square ro root ot of σ isvariance x biased where the truenor variance ofvthe samples . The estimator standard of error isariance often the sample the square ro root ot of the un unbiased the v u X u estimated y using an estimate of Unfortunately , neither theapproaches square root of pro provide vide anbunbiased estimate of t theσ. standard deviation. Both tend theunderestimate sample variance thestandard square ro ot of the un estimator the variance to thenor true deviation, butbiased are still used inofpractice. The pro vide an unbiased estimate of the standard deviation. Both approaches tend square ro root ot of the unbiased estimator of the variance is less of an underestimate. toorunderestimate the true standard deviation, but are still used in practice. The F large m, the appro approximation ximation is quite reasonable. square root of the unbiased estimator of the variance is less of an underestimate. The standard error of the mean is very useful in machine learning exp experimen erimen eriments. ts. For large m, the approximation is quite reasonable. We often estimate the generalization error by computing the sample mean of the The thenum mean very useful ininmachine experimen ts. error onstandard the test error set. of The numb beris of examples the testlearning set determines the W e often estimate the generalization error by of computing thelimit sample mean of theh accuracy of this estimate. Taking adv advan an antage tage the central theorem, whic which errorusonthat thethe test set. will The ber of examples in the testa normal set determines the tells mean be num appro approximately ximately distributed with distribution, accuracy this estimate. Taking advantage of the central theorem, which w e can useofthe standard error to compute the probability thatlimit the true exp expectation ectation tells us that the mean will b e appro ximately distributed with a normal distribution, falls in any chosen in interv terv terval. al. For example, the 95% confidence in interv terv terval al centered on w e can useisthe error to compute the probability that the true expectation the mean µ ˆmstandard is falls in any chosen interval. For example, the 95% confidence interval centered on the mean is µ ˆ is (µ ˆ m − 1.96SE( ˆm + 1.96SE( 96SE(ˆ µ ˆm)) 96SE(ˆ µ ˆ m), µ )),, (5.47) (µ ˆ ), µ ˆµ 1.96SE( µ ˆ )),SE 1.96SE( ˆmean (5.47) ˆ m+and SE(( µ ˆm )2 . In mac under the normal distribution with µ variance machine hine A is better than algorithm learning exp experiments, eriments, it is − common to sa say y that algorithm algorithmA µ ˆ SE( µ ˆof )algorithm under normal distribution with mean and v ariance . In machine B A is if thethe upp b ound of the 95% confidence in for the error upper er interv terv terval al A learning exp eriments, it is common to sa y that algorithm is b etter than algorithm less than the low lower er bound of the 95% confidence in interv terv terval al for the error of algorithm B if the upp er b ound of the 95% confidence in terv al for the error of algorithm A is B. less than the lower bound of the 95% confidence interval for the error of algorithm B. We once again consider a set of samples (1) ( m ) {x , . . . , x } dra drawn wn indep independently endently and iden identically tically from a Bernoulli distribution We once again consider a set of samples (1−x ( i ) x ) (recall P (x ; θ) = θ (1 − θ) ).PThis time we are in interested terested in computing x ,...,x drawn independently from a Bernoulli distribution 1 andmiden(itically ) ˆ the variance of the estimator θm = m i=1 x . (recall ). This time we are interested in computing P (x ; θ}) = θ (1 θ) { ! ˆ x1 X the variance of the estimator .m − θ = Var θˆm = V Var ar x ( i) (5.48) m 1 i=1 Var θˆ = Var x (5.48) 128 m P
X
!
CHAPTER 5. MACHINE LEARNING BASICS
m 1 X = 2 Var x (i) (5.49) m i=1 1 = (5.49) m Var x 1 X m = 2 θ(1 − θ) (5.50) m i=1 1 = 1 θ(1 θ) (5.50) (1 − θ) (5.51) = m2 mθ X − m 1 (1 θ) (5.51) = 1 θmθ = (5.52) m (1 − θ) m 1 X − = asθ(1 θ) (5.52) The variance of the estimator decreases a function of m, the num numb b er of examples m in the dataset. This is a common prop property erty−of popular estimators that we will The v ariance of the estimator decreases as a function return to when we discuss consistency (see Sec. 5.4.5of).m, the numb er of examples in the dataset. This is a common property of popular estimators that we will return to when we discuss consistency (see Sec. 5.4.5).
5.4.4
Trading off Bias and Variance to Minimize Mean Squared Error 5.4.4 Trading off Bias and Variance to Minimize Mean Squared Bias andError variance measure two different sources of error in an estimator. Bias
measures the exp expected ected deviation from the true value of the function or parameter. Bias and v ariance measure twovides different sources of error in anfrom estimator. Bias Variance on the other hand, pro provides a measure of the deviation the expected measures the expthat ected deviation from the trueofvalue of the parameter. estimator value any particular sampling the data is function likely to or cause. Variance on the other hand, provides a measure of the deviation from the expected What happ happens when e are given a choiceofbetw between estimators, one with estimator valueens that any w particular sampling the een datatw isolikely to cause. more bias and one with more variance? How do we choose betw etween een them? For What happ ens when w e are given a choice betw een t w o estimators, one with example, imagine that we are interes interested ted in approximating the function shown in more bias and one with more v ariance? How do w e choose b etw een them? F or Fig. 5.2 and we are only offered the choice betw between een a mo model del with large bias and example, imagine thatlarge we are interesHo tedwindoapproximating the one that suffers from variance. How we cho hoose ose betw etween eenfunction them? shown in Fig. 5.2 and we are only offered the choice between a model with large bias and The most common way to negotiate this trade-off is to use cross-v cross-validation. alidation. one that suffers from large variance. How do we choose between them? Empirically Empirically,, cross-v cross-validation alidation is highly successful on many real-w real-world orld tasks. AlterThe most common w a y to negotiate this trade-off is to use cross-v alidation. nativ w e can also compare the me squar d err (MSE) of the estimates: natively ely ely,, mean an squaree error or Empirically, cross-validation is highly successful on many real-world tasks. Alterθ) 2 ]ed error (MSE) of the estimates: MSEthe = Eme [(θˆan (5.53) natively, we can also compare squar m− E ˆ θˆm) 2 + Var(θˆm ) (5.54) θ) ] MSE = = Bias( [(θ (5.53) ˆ The MSE measures the overall=exp expected ected Bias( θ−)deviation—in + Var(θˆ ) a squared error sense— (5.54) bet etw ween the estimator and the true value of the parameter θ. As is clear from The MSE measures theMSE overall exp ectedbdeviation—in squared errorDesirable sense— Eq. 5.54 , ev evaluating aluating the incorp incorporates orates oth the bias anda the variance. θ. Asmanage between the theMSE trueand value of the parameter that is cleartofrom estimators areestimator those withand small these are estimators keep Eq. , evbias aluating MSE incorp oratesinbcoth the b oth5.54 their and the variance somewhat hec heck. k. bias and the variance. Desirable estimators are those with small MSE and these are estimators that manage to keep The relationship betw etween een bias and variance is tightly linked to the machine both their bias and variance somewhat in check. learning concepts of capacity capacity,, underfitting and ov overfitting. erfitting. In the case where genThe relationship between bias and variance is tightly linked to the machine 129 and overfitting. In the case where genlearning concepts of capacity, underfitting
CHAPTER 5. MACHINE LEARNING BASICS
x x
eralization error is measured by the MSE (where bias and variance are meaningful comp componen onen onents ts of generalization error), increasing capacity tends to increase variance eralization error is This measured by the MSE (where bias and ariance and decrease bias. is illustrated in Fig. 5.6, where we vsee againare themeaningful U-shap U-shaped ed comp onen ts of generalization error), increasing capacity tends to increase v ariance curv curvee of generalization error as a function of capacit capacity y. and decrease bias. This is illustrated in Fig. 5.6, where we see again the U-shaped curve of generalization error as a function of capacity.
5.4.5
Consistency
5.4.5 So far weConsistency hav havee discussed the prop properties erties of various estimators for a training set of fixed size. Usually Usually,, we are also concerned with the behavior of an estimator as the So far twe e discussed the prop of various estimators for a training setb er of amoun amount of hav training data grows. In erties particular, we usually wish that, as the num numb fixed size. Usually also concerned ehavior ofconv an estimator the of data points ourare dataset increases, with our pthe oin ointtbestimates converge erge to theastrue m in, we amoun t of training data grows. In particular, we usually wish that, as the num b er value of the corresp corresponding onding parameters. More formally formally,, we would lik likee that of data points m in our dataset increases, our p oint estimates converge to the true p value of the corresponding parameters. formally , we would like that (5.55) lim θˆMore . m→θ m→∞
lim θˆ θ. (5.55) p The sym symb bol → means that the conv convergence ergence is in probability probability,, i.e. for any > 0, → P (|θˆm − θ| > ) → 0 as m → ∞ . The condition describ described ed b by y Eq. 5.55 is > 0, The sym b ol means that the conv ergence is in probability , i.e. for any , with kno known wn as consistency onsistency.. It is sometimes referred to as weak consistency consistency, ˆ θ > P ( θ consistency 0 as m . Thesur condition describ 5.55 is ˆ tobθy. Eq. → ) referring strong to the almost sure e con convergence vergence of θed Almost sur sure e kno| wn −as |consistency sometimes referred to as weak consistency, with → . It is → ∞ strong consistency referring to the almost 130 sure convergence of θˆ to θ . Almost sure
CHAPTER 5. MACHINE LEARNING BASICS
conver onvergenc genc gencee of a sequence of random variables x (1), x (2), . . . to a value x occurs when p(lim m→∞ x(m) = x) = 1. convergence of a sequence of random variables x , x , . . . to a value x occurs Consistency x ensures that the bias induced by the estimator is assured to when p(lim = x) = 1. diminish as the num umb ber of data examples gro grows. ws. How Howev ev ever, er, the rev reverse erse is not Consistency ensures that the bias induced b y the estimator is assured to true—asymptotic unbiasedness does not imply consistency consistency.. For example, consider diminish asthe themean numbparameter er of dataµexamples grows. However,Nthe rev is not 2 (x ; µ, σerse estimating of a normal distribution ), with a true—asymptotic unbiasedness does not imply consistency . F or example, consider (1) ( m ) dataset consisting of m samples: {x , . . . , x } . We could use the first sample ; µ, σE(),ˆ estimating the mean of a normalˆ distribution (1) of the dataset x θ= x(1). In that(xcase, θmwith ) = θa as parameter an unbiase unbiased dµ estimator: m , . . . , x x dataset consisting of samples: . W e could use the first sample N are seen. so the estimator is un unbiased biased no matter how man many y data p oin oints ts E ˆThis, of ˆ x θ = x ( θ )is=not θ of the dataset as an unbiase d estimator: . In that case, } un course, implies that the estimate is{ asymptotically unbiased. biased. Ho How w ev ever, er, this the estimator is unbiased howthat man p oin ˆ θmy data → θ as mts→are ∞.seen. This, of asoconsisten consistent t estimator as it isno notmatter the case course, implies that the estimate is asymptotically unbiased. However, this is not θ θ as m . a consistent estimator as it is not the case that ˆ 5.5 Maxim Maximum um Lik Likeliho eliho elihoo od Estimation → →∞ Previously Previously, , we hav have e seen some definitions of common estimators and analyzed 5.5 Maxim um Lik eliho od Estimation their properties. But where did these estimators come from? Rather than guessing Previously we havemight seen mak some of common estimators and that some ,function make e adefinitions go goo od estimator and then analyzing its analyzed bias and their properties. But where did these estimators come from? Rather than variance, we would lik likee to ha hav ve some principle from whic which h we can derive guessing specific that somethat function might make a gofor oddifferen estimator and then analyzing its bias and functions are go goo od estimators different t mo models. dels. variance, we would like to have some principle from which we can derive specific The most principle the maxim maximum likeliho eliho elihoo od principle. functions thatcommon are goodsuch estimators forisdifferen t moum dels.lik Consider a set of m examples X = {x (1), . . . , x(m) } dra drawn wn indep independently endently from The most common such principle is the maximum likelihood principle. the true but unknown data generating distribution p ( x ) . data X Consider a set of m examples = x , . . . , x drawn independently from Let pmodel ( x; θ) b e a parametric family of probability distributions over the the true but unknown data generating{distribution p} (x). same space indexed by θ. In other words, p model(x ; θ) maps an any y configuration x p ( x ; θ Let ) b e a parametric family of probability distributions over the to a real num numb b er estimating the true probability pdata (x). same space indexed by θ. In other words, p (x ; θ) maps any configuration x The maxim maximum um likelihoo likelihood d estimator for θ is then defined as to a real numb er estimating the true probability p (x). The maximum likelihoo estimator θ is θMLd = arg max for p model (Xthen ; θ) defined as θ X θ = arg max Y pm ( ; θ) = arg max pmodel(x (i); θ) θ
i=1
(5.56) (5.56) (5.57)
= arg max p (x ; θ) (5.57) This pro product duct over man many y probabilities can be inconv inconvenien enien enientt for a variet ariety y of reasons. For example, it is prone to numerical underflow. To obtain a more con conv venien enientt This pro duct o v er man y probabilities can b e inconv enien t for a v ariet y of reasons. but equiv equivalent alent optimization problem, we observe that taking the logarithm of the Y F or example, is prone umerical underflow. Tenien o obtain a more con venien lik likeliho eliho elihoo od do does esitnot changetoitsnarg do does es conv convenien eniently tly transform a pro product ductt max but but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product 131
CHAPTER 5. MACHINE LEARNING BASICS
in into to a sum: into a sum:
θ ML = arg max θ
m X i=1
log pmodel (x(i) ; θ).
(5.58)
θ es not = arg max when log pwe rescale (x ;the θ). cost function, w(5.58) Because the argmax do does change e can divide by m to obtain a version of the criterion that is expressed as an exp expectation ectation Because the argmax do es not c hange when w e rescale the cost function, we can with resp respect ect to the empirical distribution pˆdata defined by the training data: divide by m to obtain a version of the criterion that is expressed as an expectation θ ML = arg max EX p model(xb;yθthe ). training data: (5.59) ∼ˆ p p with respect to the empirical distribution ˆ logdefined θ E θ = arg max log p (x; θ). (5.59) One wa way y to interpret maximum lik likelihoo elihoo elihood d estimation is to view it as minimizing the dissimilarit dissimilarity y bet etw ween the empirical distribution pˆdata defined by the training One wa y to interpret maximum likelihoo d estimation is to view as minimizing set and the mo model del distribution, with the degree of dissimilarit dissimilarity y bitet etw ween the two p ˆ the dissimilarit y b et w een the empirical distribution defined b y the training measured by the KL divergence. The KL div divergence ergence is giv given en by set and the model distribution, with the degree of dissimilarity between the two D KL pˆdataKL ) = E ∼ˆpThe [log kpmodel (x) − log pmodel x)] . (5.60) data measured by (the divergence. KL pˆdiv ergence is giv en b(y E The term only data generating pro process, the D on (the pˆ left = [logofpˆ the(x p is a)function ) log p (x)] . cess, not (5.60) mo model. del. This means when we train the mo model del to minimize the KL divergence, we k is a function only of the data − generating process, not the The the left needterm only on minimize model. This means when we−train model to minimize the KL divergence, we (x)] (5.61) p model E ∼ˆpthe [log need only minimize whic which h is of course the same as E the maximization in Eq. 5.59. (x)] (5.61) [log p Minimizing this KL div divergence ergence corresp corresponds onds exactly to minimizing the cross− the maximization in Eq. 5.59. which is of course the same as en entrop trop tropy y betw etween een the distributions. Many authors use the term “cross-en “cross-entrop trop tropy” y” to Minimizing this KL div ergence corresp onds exactly to minimizing the crossiden identify tify sp specifically ecifically the negativ negativee log-lik log-likeliho eliho elihoood of a Bernoulli or softmax distribution, en trop y b etw een the distributions. Many authors use the log-lik term “cross-en to but that is a misnomer. Any loss consisting of a negative log-likelihoo elihoo elihood d trop is a y” cross iden tifyy sp ecifically theempirical negative log-lik elihood defined of a Bernoulli softmax distribution, en entrop trop tropy betw etween een the distribution by theortraining set and the but that is a misnomer. Any loss consisting of a negative log-lik elihoo d is a cross mo model. del. For example, mean squared error is the cross-entrop cross-entropy y b et etw ween the empirical en trop y b etw een the empirical distribution defined b y the training set and the distribution and a Gaussian mo model. del. model. For example, mean squared error is the cross-entropy b etween the empirical We can th thus us see maximum lik likeliho eliho elihoo o d as an attempt to make the mo model del disdistribution and a Gaussian model. tribution matc match h the empirical distribution pˆdata . Ideally Ideally,, we would lik likee to match W e can th us see maximum lik eliho o d as an attempt to make the model disthe true data generating distribution pdata , but we ha hav ve no direct access to this tribution matc h the empirical distribution . Ideally , we w ould lik e to match p ˆ distribution. the true data generating distribution p , but we have no direct access to this While the optimal θ is the same regardless of whether we are maximizing the distribution. lik likeliho eliho elihoo od or minimizing the KL divergence, the values of the ob objective jective functions θ While the optimal is the same regardless of whether w e are the are differen different. t. In soft software, ware, we often phrase both as minimizing amaximizing cost function. lik eliho o d or minimizing the KL divergence, the v alues of the ob jective functions Maxim Maximum um likelihoo likelihood d thus becomes minimization of the negative log-lik log-likeliho eliho elihoo od are differen t. In soft ware, we often phrase b oth as minimizing a cost function. (NLL), or equiv equivalently alently alently,, minimization of the cross entrop entropy y. The persp perspective ective of Maxim um lik likelihoo thus becomes minimization the negative eliho od maxim maximum um likelihoo elihoo elihood dd as minimu minimum m KL div divergence ergence of becomes helpfullog-lik in this case (NLL), , minimization of the cross . zero. The persp of becauseor theequiv KL alently div divergence ergence has a known minim minimum umentrop value yof The ective negative maxim um lik elihoo d as minimu m KL div ergence becomes helpful in this case log-lik log-likeliho eliho elihoo od can actually become negative when x is real-v real-valued. alued. because the KL divergence has a known minimum value of zero. The negative 132 when x is real-valued. log-likelihood can actually become negative
CHAPTER 5. MACHINE LEARNING BASICS
5.5.1
Conditional Log-Likelihoo Log-Likelihood d and Mean Squared Error
The maxim maximum um lik likeliho eliho elihoo od estimator can readily generalized to the case where 5.5.1 Conditional Log-Likelihoo d and be Mean Squared Error our goal is to estimate a conditional probabilit probability y P (y | x ; θ) in order to predict y The maxim um lik eliho o d estimator can readily be generalized to thethe case where giv given en x . This is actually the most common situation because it forms basis for our goal is to estimate a conditional probabilit y ) in order to predict P ( ; θ y y x X Y most sup supervised ervised learning. If represen represents ts all our inputs and all our observ observed ed given x . then This the is actually the most common situation b|ecause is it forms the basis for targets, conditional maximum lik likelihoo elihoo elihood d estimator most supervised learning. If X represents all our inputs and Y all our observed targets, then the conditional d estimator is θ MLmaximum = arg maxlikPelihoo (Y | X ; θ ). (5.62) θ
θ = arg max P (Y X ; θ). (5.62) If the examples are assumed to b e i.i.d., then this can be decomposed into | m X If the examples are assumed to b e i.i.d., then this can be decomposed into θ ML = arg max log P (y(i) | x(i) ; θ). (5.63) θ
θ
= arg max
i=1
log P (y
x ; θ).
(5.63)
|
Linear regression, in intro tro troduced duced earlier in Sec. 5.1.4, ma may y be justified as a maximum likelihoo likelihood d pro procedure. cedure. Linear regression, Previously Previously,, we motiv motivated ated linear regression as an algorithm that learns to tak take e an X in tro duced earlier in Sec. 5.1.4 , ma y b e justified as a maximum likelihoo d pro cedure. input x and pro produce duce an output value yˆ. The mapping from x to yˆ is chosen to Previously , we motiv ated linear regression aswe anin algorithm that learns take an. minimize mean squared error, a criterion that intro tro troduced duced more or less to arbitrarily arbitrarily. x y ˆ x y ˆ input and pro duce an output v alue . The mapping from to is chosen to We no now w revisit linear regression from the poin ointt of view of maximum likelihoo likelihood d minimize mean squared error, a criterion we introyˆduced more or less arbitrarily estimation. Instead of pro producing ducing a singlethat prediction , we now think of the mo model del . W now revisita linear regression from the maximum p( ypoin | xt). ofWview asepro producing ducing conditional distribution e canofimagine thatlikelihoo with and yˆ, we examples estimation. Instead of pro a single prediction now thinkwith of the del infinitely large training set,ducing we migh might t see several training themo same p ( y x as pro ducing a conditional distribution ) . W e can imagine that with input value x but differen differentt values of y . The goal of the learning algorithm is nowan to infinitely large training weallmigh t see differen several the same | ttraining p( y | set, y valuesexamples fit the distribution of those different that are with all compatible x) to input xv.alue but differen t values of yregression . The goalalgorithm of the learning algorithmbefore, is nowwe to with To xderive the same linear we obtained p ( y fit the pdistribution different yˆ(x values that compatible (y | x ) = N (yy; yˆ(xx); to w)all , σ2of ; w) giv define ). those The function gives es are the all prediction of with . T o derive the same linear regression algorithm w e obtained b efore, we x | the mean of the Gaussian. In this example, we assume that the variance is fixed to p(y xt) σ=2 chosen (y; yˆ(b xy; w ) , σuser. yˆ(x ;this w) cgiv defineconstan ). The es the prediction of some constant the We function will see that hoice of the functional the mean this example, weodassume that the variancetoisyield fixedthe to N y| |the x ) Gaussian. form of p(of causes the In maximum lik likeliho eliho elihoo estimation pro procedure cedure σ chosen b some constan thedev user. We will see that of the same learningt algorithm asy we developed eloped before. Sincethis thechoice examples arefunctional assumed p ( y x form of ) causes the maximum lik eliho o d estimation pro cedure to yield the to be i.i.d., the conditional log-likelihoo log-likelihood d (Eq. 5.63) is giv given en by same learning | algorithm as we developed before. Since the examples are assumed m X to be i.i.d., the conditional log-likelihood (Eq. 5.63) is given by log p(y (i) | x (i); θ) (5.64) i=1
log p(y x ; θ) m X |yˆ(i) − y (i)|| 2 m = − m log σ − | log(2π ) − 2 2σ2 i=1 y ˆ y m = m log σ log(2π ) 2 2 | −σ || X 133 − − − X
(5.64) (5.65) (5.65)
CHAPTER 5. MACHINE LEARNING BASICS
where yˆ(i) is the output of the linear regression on the i-th input x(i) and m is the num umb ber of the training examples. Comparing the log-likelihoo log-likelihood d with the mean y ˆ i x where is the output of the linear regression on the -th input and m is the squared error, m number of the training examples. Comparing d with the mean 1 X (i) the (log-likelihoo || ||ˆ yˆ − y i)||2, MSE train = (5.66) squared error, m 1 i=1 = yˆ y MSE , (5.66) we immediately see that maximizingmthe log-likelihoo log-likelihood d with respect to w yields ||es minimizing − || the mean squared error. the same estimate of the parameters w as do does w e immediately see that maximizing the log-likelihoo d with to w yields The tw two o criteria ha hav ve different values but the same lo location cation of respect the optimum. This w the same estimate of the parameters as do es minimizing the mean squared error. justifies the use of the MSE as a maxim maximum um likelihoo elihoo elihood d estimation pro procedure. cedure. As we X lik The tw o criteria ha v e different v alues but the same lo cation of the optimum. will see, the maximum lik likelihoo elihoo elihood d estimator has sev several eral desirable prop properties. erties. This justifies the use of the MSE as a maximum likelihood estimation procedure. As we will see, the maximum likelihood estimator has several desirable properties.
5.5.2
Prop Properties erties of Maxim Maximum um Lik Likeliho eliho elihoo od
5.5.2 erties Maxim Likdeliho od is that it can be sho The mainProp app appeal eal of theofmaxim maximum um um lik likelihoo elihoo elihood estimator shown wn to be the best estimator asymptotically asymptotically,, as the num umber ber of examples m → ∞ , in terms The eal of the maxim likelihoo d estimator is that it can be shown to of itsmain rate app of conv convergence ergence as m um increases. be the best estimator asymptotically, as the number of examples m , in terms Under appropriate conditions, maximum lik likeliho eliho elihoo od estimator has the prop propert ert erty y of its rate of convergence as m increases. →∞ of consistency (see Sec. 5.4.5 ab abov ov ove), e), meaning that as the number of training Under approac appropriate conditions, eliho ododestimator property examples approaches hes infinit infinity y, themaximum maxim maximum umliklik likeliho eliho elihoo estimatehas of athe parameter of consistency 5.4.5 abov e), meaning thatconditions as the number con conv verges to the(see trueSec. value of the parameter. These are: of training examples approaches infinity, the maximum likelihood estimate of a parameter con•verges thedistribution true value ofp data the parameter. Thesethe conditions are: pmodel(·; θ). The to true must lie within mo model del family Otherwise, no estimator can recov recover er pdata . ( ; θ). The true distribution p must lie within the model family p pdata •• The true distribution must corresp correspond Otherwise, no estimator can recov er p ond . to exactly one value of θ. Other· wise, maximum likelihoo likelihood d can recov recover er the correct pdata, but will not be able p ofmθ θcessing. Thedetermine true distribution ustwas corresp to exactly one value of . Otherto which value usedond by the data generating pro processing. • wise, maximum likelihoo d can recover the correct p , but will not be able to determine value of θ wasb esides used by dataum generating pro cessing. There are other which inductiv inductive e principles thethe maxim maximum lik likelihoo elihoo elihood d estimator, man many y of which share the prop property erty of being consistent estimators. Ho How wev ever, er, consisThere are other inductiv e principles b esides the maxim um lik elihoo d estimator, ten tentt estimators can differ in their statistic efficiency, meaning that one consisten consistentt man y of which share the prop erty of b eing consistent estimators. Ho w ev er, consism, estimator ma may y obtain low lower er generalization error for a fixed num umb ber of samples ten t estimators can differ in their statistic efficiency , meaning that one consisten or equiv equivalen alen alently tly tly,, ma may y require fewer examples to obtain a fixed lev level el of generalizationt estimator may obtain lower generalization error for a fixed number of samples m , error. or equivalently, may require fewer examples to obtain a fixed level of generalization Statistical efficiency is typically studied in the par arametric ametric case (lik (likee in linear error. regression) where our goal is to estimate the value of a parameter (and assuming is ttrue ypically studied in ametric case (like Ainwlinear it isStatistical possible toefficiency identify the parameter), notthe thepar value of a function. ay to regression) where our goal is to estimate the v alue of a parameter (and assuming measure ho how w close we are to the true parameter is by the exp expected ected mean squared it is possible to identify the true parameter), not the v alue of a function. A way to error, computing the squared difference betw etween een the estimated and true parameter measure how close we are to the true parameter is by the expected mean squared error, computing the squared difference 134 between the estimated and true parameter
CHAPTER 5. MACHINE LEARNING BASICS
values, where the exp expectation ectation is over m training samples from the data generating distribution. That parametric mean squared error decreases as m increases, and m training vfor alues, wherethe theCramér-Rao expectationlow is er over samples from, 1946 the data generating lower bound (Rao, 1945 ; Cramér ) shows that no m large, m distribution. That parametric mean squared error decreases as increases, and consisten consistentt estimator has a low lower er mean squared error than the maximum lik likeliho eliho elihoo od for large, the Cramér-Rao low er b ound ( Rao , 1945 ; Cramér , 1946 ) shows that no m estimator. consistent estimator has a lower mean squared error than the maximum likelihood For these reasons (consistency and efficiency), maximum lik likelihoo elihoo elihood d is often estimator. considered the preferred estimator to use for machine learning. When the num number ber F or these reasons (consistency and efficiency), maximum lik elihoo d is often of examples is small enough to yield overfitting behavior, regularization strategies considered the preferred use for machine learning. When the num ber suc such h as weigh weight t deca decay y mayestimator be used totoobtain a biased version of maximum lik likeliho eliho elihoo od of examples small enough to yield odata verfitting behavior, regularization strategies that has lessisvariance when training is limited. such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.
5.6
Ba Bay yesian Statistics
So e ha hav discussed fr freequentist statistics and approac approaches hes based on estimating a 5.6far wBa yveesian Statistics single value of θ, then making all predictions thereafter based on that one estimate. So far weapproach have discussed frequentist hes based on aestimating a Another is to consider all statistics p ossible vand aluesapproac of θ when making prediction. single v alue of , then making all predictions thereafter based on that one estimate. θ The latter is the domain of Bayesian statistics. Another approach is to consider all p ossible values of θ when making a prediction. As discussed in Sec. 5.4.1, the frequentist persp perspectiv ectiv ectivee is that the true parameter The latter is the domain of Bayesian statistics. value θ is fixed but unknown, while the poin ointt estimate θˆ is a random variable on As discussed in Sec. 5.4.1, the frequentist ectiv is that the true parameter accoun account t of it being a function of the dataset persp (whic (which h is eseen as random). value θ is fixed but unknown, while the point estimate θˆ is a random variable on Thet Ba Bay yesian erspective ective on statistics quiteh different. Ba Bay yesian uses accoun of it beingpaersp function of the datasetis(whic is seen as The random). probabilit probability y to reflect degrees of certaint certainty y of states of kno knowledge. wledge. The dataset is The observed Bayesian and persp statistics different. yesian uses θ directly soective is not on random. On is thequite other hand, theThe trueBaparameter probabilit y to reflect degrees of certaint y of states of kno wledge. The dataset is is unkno unknown wn or uncertain and thus is represen represented ted as a random variable. directly observed and so is not random. On the other hand, the true parameter θ Before observing the data, we represent our knowledge of θ using the prior is unknown or uncertain and thus is represented as a random variable. pr prob ob obability ability distribution distribution,, p (θ ) (sometimes referred to as simply “the prior”). GenBefore observing the data, we represent ouraknowledge of θ using theis prior erally erally,, the mac machine hine learning practitioner selects prior distribution that quite pr ob ability distribution , ) (sometimes referred to as simply “the prior”). Genp ( θ broad (i.e. with high en entrop trop tropy) y) to reflect a high degree of uncertain uncertaintty in the value of erally , the mac hine learning practitioner selects a prior distribution that is quite θ before observing an any y data. For example, one migh mightt assume a priori that θ lies broad (i.e. withrange high or entrop y) towith reflect a high degree of uncertain tyy in the vinstead alue of in some finite volume, a uniform distribution. Man Many priors θ before θ lies observing an data. Forsolutions example,(suc oneh migh t assume a priori co that reflect a preference fory “simpler” (such as smaller magnitude coefficients, efficients, in some finite that range volume, withconstant). a uniform distribution. Many priors instead or a function is or closer to being reflect a preference for “simpler” solutions (such as smaller(1) magnitude coefficients, No Now w consider that we hav havee a set of data samples {x , . . . , x (m) }. We can or a function that is closer to being constant). reco recov ver the effect of data on our belief ab about out θ by com combining bining the data lik likeliho eliho elihoo od x , . . . , x No w consider that we hav e a set of data samples . W e can (1) ( m ) | θ) with the prior via Bay Bayes’ es’ rule: p(x , . . . , x recover the effect of data on our belief about θ by combining the data} likelihood { (1),rule: (m) | θ )p(θ ) θ) with(1)the prior via Bay es’ p(x , . . . , x p ( x . . . , x p(θ | x , . . . , x (m) ) = (5.67) p(x(1), . . . , x(m)) | p(x , . . . , x θ)p(θ ) p(θ x , . . . , x ) =135 (5.67) p(x , . . . , x| ) |
CHAPTER 5. MACHINE LEARNING BASICS
In the scenarios where Ba Bay yesian estimation is typically used, the prior b egins as a relativ relatively ely uniform or Gaussian distribution with high en entrop trop tropy y, and the observ observation ation In the scenarios where Ba y esian estimation is t ypically used, the prior b egins as aa of the data usually causes the posterior to lose entrop entropy y and concen concentrate trate around relativ ely uniform or Gaussian distribution with high entropy, and the observation few highly likely values of the parameters. of the data usually causes the posterior to lose entropy and concentrate around a Relativ Relativee to maxim maximum um lik likelihoo elihoo elihood d estimation, Ba Bay yesian estimation offers two few highly likely values of the parameters. imp importan ortan ortantt differences. First, unlik unlikee the maxim maximum um likelihoo likelihood d approach that makes Relativ e to maxim um lik elihoo d estimation, Ba y esian offers two predictions using a point estimate of θ , the Ba Bay yesian approac approach hestimation is to make predictions importan t differences. First, e or theexample, maximum likelihoo d approach that makes θ. F m examples, using a full distribution ov over erunlik after observing the θ predictions using a p oint estimate of , the Ba y esian approac h is to make predictions ( m +1) predicted distribution ov over er the next data sample, x , is given by using a full distribution over θ.Z For example, after observing m examples, the predicted distribution over the next data sample, x , is given by p(x (m+1) | x (1) , . . . , x(m)) = p(x (m+1) | θ)p(θ | x(1), . . . , x (m) ) dθ. (5.68)
p(x (5.68) x , . . . , x ) = p(x θ)p(θ x , . . . , x ) dθ. Here eac each h value of θ with positive probability density contributes to the prediction | of the next example, with the contribution w|eigh eighted ted |by the posterior density itself. Here eac h value of θ ed with probability to the ab prediction (1) , . . . , x (m)} {xpositive After ha having ving observ observed , if we density are stillcontributes quite uncertain about out the of the next example, with the contribution w eigh ted by the p osterior density itself. Z value of θ , then this uncertain uncertaintty is incorporated directly into an any y predictions we x , . . . , x After ha ving observ ed , if we are still quite uncertain ab out the migh mightt mak make. e. value of θ , then this uncertain ty is incorporated directly into any predictions we { } In Sec. 5.4, we discussed ho how w the frequen frequentist tist approach addresses the uncertaint uncertainty y might make. in a giv given en point estimate of θ by ev evaluating aluating its variance. The variance of the In Sec. 5.4 , w e discussed ho w the frequen tist approach the uncertaint estimator is an assessment of how the estimate might addresses change with alternativey θ in a giv en p oint estimate of b y ev aluating its v ariance. The v ariance of deal the samplings of the observ observed ed data. The Ba Bay yesian answer to the question of ho how w to estimator is an assessment of how the might change alternative with the uncertaint uncertainty y in the estimator is toestimate simply in integrate tegrate over it,with whic which h tends to samplings of the observ ed data. The Ba y esian answer to the question of ho w to deal protect well against ov overfitting. erfitting. This integral is of course just an application of with the uncertaint y in the estimator is to simply in tegrate o v er it, whic h tends to the laws of probabilit probability y, making the Ba Bay yesian approac approach h simple to justify justify,, while the protect well against ov erfitting. This integral is of course just an ofc frequen frequentist tist machinery for constructing an estimator is based on the application rather ad ho hoc the laws to of summarize probability, all making the Bacon yesian approac simple to justify , while the decision knowledge contained tained in theh dataset with a single point frequentist machinery for constructing an estimator is based on the rather ad ho c estimate. decision to summarize all knowledge contained in the dataset with a single point The second imp importan ortan ortantt difference b et etw ween the Bay Bayesian esian approac approach h to estimation estimate. and the maximum likelihoo likelihood d approac approach h is due to the contribution of the Ba Bay yesian The second impThe ortanprior t difference b etween the Bayesianprobability approach to estimation prior distribution. has an influence by shifting mass densit density y and the maximum likelihoo d approac h is due to the contribution of the Ba y esian to tow wards regions of the parameter space that are preferred a priori. In practice, prior distribution. The prior has an influence by that shifting mass smo densit y the prior often expresses a preference for models are probability simpler or more smooth. oth. to w ards regions of the parameter space that are preferred a priori . In practice, Critics of the Bay Bayesian esian approach identify the prior as a source of sub subjective jective human the priortoften expresses a preference for models that are simpler or more smooth. judgmen judgment impacting the predictions. Critics of the Bayesian approach identify the prior as a source of sub jective human Ba Bay yesian metho methods ds typically generalize muc much h better when limited training data judgment impacting the predictions. is av available, ailable, but typically suffer from high computational cost when the num umb ber of Ba y esian metho ds typically generalize muc h b etter when limited training data training examples is large. is available, but typically suffer from high computational cost when the number of training examples is large. 136
CHAPTER 5. MACHINE LEARNING BASICS
Here we consider the Ba Bay yesian estimation approach to learning the linear regression parameters. In linear regression, Here weRconsider the the Bayv esian n to predict we learn a linear mapping from an input vector alue estiof a x∈ mation approach to learning the linear regression parameters. In linear regression, n scalar y ∈ R. The prediction is parametrized by the vRector w ∈ R : we learn a linear mapping from an input vector x to predict the value of a R R > scalar y . The prediction is parametrized : yˆ = w x. by the (5.69) ∈ vector w ∈ ∈ train w) , y x(.train) ), we can express the prediction (5.69) Giv Given en a set of m training samples (Xyˆ(= of y ov over er the en entire tire training set as: ,y Given a set of m training samples (X ), we can express the prediction ( train ) ( train ) of y over the entire training setyˆ as: = X w. (5.70) yˆ =X w.on y(train), we hav Expressed as a Gaussian conditional distribution havee
(5.70)
) (train) Expressed Gaussian distribution , we have p(y(train) | as X(atrain , w) = conditional N (y (train); X w, I ) on y (5.71) 1 (train) > ( train ) ( train ) ( train ) p(y ;X X , w) ∝ = exp (y − (y (5.71) w) , −wX, I ) w) (y −X 2 | N 1 (5.72) (y w) (y w) , exp X X 2 ∝ − − − (5.72) where we follow the standard MSE formulation in assuming that the Gaussian variance on y is one. In what follo follows, ws, to reduce the notational burden, we refer to where we follow the standard MSE ( train ) ( train ) (X ,y ) as simply (X , y ). formulation in assuming that the Gaussian variance on y is one. In what follows, to reduce the notational burden, we refer to posterior over the mo model del parameter vector w, we (X To determine ,y ) the as simply (Xdistribution , y ). first need to sp a prior distribution. The prior should reflect our naive belief specify ecify w, we T o determine the p osterior distribution o v er the mo del parameter vector ab about out the value of these parameters. While it is sometimes difficult or unnatural first need to ecify baeliefs priorindistribution. prior should belief to express ourspprior terms of theThe parameters of thereflect mo model, del,our in naive practice we ab out the v alue of these parameters. While it is sometimes difficult or unnatural typically assume a fairly broad distribution expressing a high degree of uncertain uncertaintty to express our prior b eliefs in terms of the parameters of the mo del, in practice we θ ab about out . F For or real-v real-valued alued parameters it is common to use a Gaussian as a prior typically assume a fairly broad distribution expressing a high degree of uncertainty distribution: about θ . For real-valued parametersit is common to use a Gaussian as a prior 1 distribution: (5.73) p(w) = N (w; µ0 , Λ 0) ∝ exp − (w − µ0 )> Λ−1 0 (w − µ0) 2 1 (5.73) p(w) = (w; µ , Λ ) exp (w µ ) Λ (w µ ) where µ0 and Λ 0 are the prior distribution mean vector and cov covariance ariance matrix 2 ∝ − − − resp respectiv ectiv ectively ely ely..1 N where µ and Λ are the prior distribution mean vector and covariance matrix the. prior th thus us sp specified, ecified, we can no now w pro proceed ceed in determining the posterior respWith ectively distribution over the mo model del parameters. With the prior thus specified, we can now proceed in determining the posterior distribution over p(w | X , y) ∝ p(ythe | Xmo , wdel )p(parameters. w) (5.74)
there to)p assume p(wUnless X, y ) isp(ayreason X, w (w) a particular covariance structure, we typically assume (5.74)a diagonal covariance matrix . | ∝ | 137
CHAPTER 5. MACHINE LEARNING BASICS
1 1 > > −1 ∝ exp − (y − X w) (y − X w) exp − (w − µ 0) Λ 0 (w − µ 0 ) 2 2 1 1 exp (y X w) (y X w) exp (w µ ) Λ (w (5.75) µ ) 2 2 ∝ −1 − − − − − > −1 (5.75) . ∝ exp − −2y >X w + w > X > X w + w> Λ−1 0 w − 2µ0 Λ 0 w 2 1 w +w Λ w 2µ Λ w(5.76) . exp 2y X w + w X X 2 ∝ − − −1 > − −1 (5.76) µ = Λ X y + Λ0 µ0 . Using We no now w define Λ m = X> X + Λ−1 and m m 0 these new variables, we find that the posterior ma may y be rewritten as a Gaussian X+Λ W e now define Λ = X and µ = Λ X y + Λ µ . Using distribution: these new variables, wefind that the posterior may be rewritten asa Gaussian 1 1 > −1 distribution: p(w | X , y) ∝ exp − (w − µm )> Λ−1 (5.77) m (w − µ m) + µm Λm µ m 2 2 1 1 (5.77) p(w X , y) exp 1 (w µ )> Λ−1(w µ ) + µ Λ µ (5.78) ∝ exp − 2 (w − µm ) Λm (w − µ m) . 2 | ∝ −2 − − 1 (5.78) exp (w µ ) Λ (w µ ) . 2 the parameter vector w ha All terms that do not include hav ve been omitted; they ∝ fact that − the − − be normalized to in are implied by the distribution must integrate tegrate to 1. All terms that do not include the parameter v ector ha v e b een omitted; they w Eq. 3.23 shows how to normalize a multiv multivariate ariate Gaussian distribution. are implied by the fact that the distribution must be normalized to integrate to 1. Examining this posterior distribution allows us to gain some in intuition tuition for the Eq. 3.23 shows how to normalize a multivariate Gaussian distribution. effect of Bay Bayesian esian inference. In most situations, we set µ0 to 0. If we set Λ0 = α1 I , thissame posterior distribution us to gain linear some in tuition forwith the µm giv thenExamining gives es the estimate of w as allows do does es frequentist regression µ tothe 0. Bay = I, oftBay esian inference. situations, we set If weesian set Λ w >most w . One aeffect weigh weight decay penalt enalty y of αIn difference is that Bayesian estimate µ givesif the then same estimate of w as are doesnot frequentist linear regression with alpha is undefined is set to zero—-we allo allow wed to begin the Ba Bay yesian α w w a weigh t decay p enalt y of . One difference is that the Bay esian estimate learning pro process cess with an infinitely wide prior on w. The more imp important ortant difference alphaestimate is that undefined ifyesian is set topro zero—-we are not allo wed to begin the yesian is the Ba Bay provides vides a cov covariance ariance matrix, showing ho how w Ba like likely ly all w learning pro cess with an infinitely wide prior on . The more imp ortant difference the differen differentt values of w are, rather than pro providing viding only the estimate µm. is that the Bayesian estimate provides a covariance matrix, showing how likely all the different values of w are, rather than providing only the estimate µ .
5.6.1
Maxim Maximum um
(MAP) Estimation
While most principled approac approach h is (MAP) to mak makee predictions using the full Bay Bayesian esian 5.6.1 theMaxim um Estimation posterior distribution over the parameter θ , it is still often desirable to ha hav ve a While most principled h isreason to makfor e predictions full Bay single the point estimate. Oneapproac common desiring a using point the estimate isesian that θ p osterior distribution o v er the parameter , it is still often desirable to ha v a most op operations erations inv involving olving the Ba Bay yesian posterior for most in interesting teresting mo models dels eare single point and estimate. common reason for desiring a point estimate is than that in intractable, tractable, a pointOne estimate offers a tractable approximation. Rather most op erations inv olving the Ba y esian posterior for most in teresting mo dels simply returning to the maxim maximum um likelihoo likelihood d estimate, we can still gain someare of intractable, point estimate offers a tractable Rather than the benefit ofand theaBay Bayesian esian approac approach h by allowing theapproximation. prior to influence the choice simply to theOne maxim um likelihoo can still some of of the preturning oin ointt estimate. rational way to ddoestimate, this is towe choose the gain maximum a the b enefit of the Bay esian approac h b y allowing the prior to influence the choice posteriori (MAP) point estimate. The MAP estimate chooses the p oin ointt of maximal of the point estimate. One rational way to do this is to choose the maximum a posteriori (MAP) point estimate. The MAP 138 estimate chooses the p oint of maximal
CHAPTER 5. MACHINE LEARNING BASICS
posterior probability (or maximal probability densit density y in the more common case of con contin tin tinuous uous θ): posterior probability (or maximal probability density in the more common case of continuous θ): = arg max p(θ | x) = arg max log p(x | θ) + log p(θ ). θ MAP (5.79) θ
θ
θ = arg max p(θ x) = arg max log p(x θ) + log p(θ ). (5.79) We recognize, ab abo ove on the right hand side, log p(x | θ ), i.e. the standard log| lik likeliho eliho elihoo od term, and log p(θ),| corresp corresponding onding to the prior distribution. We recognize, above on the right hand side, log p(x θ ), i.e. the standard logAs an regression model delprior withdistribution. a Gaussian prior on the likeliho od example, term, andconsider log p(θ)a, linear corresp onding 1tomo the | 2 weigh eights ts w . If this prior is giv given en by N ( w; 0, λ I ), then the log-prior term in Eq. As an example, consider a linearλregression mo with a Gaussian the w > w weigh 5.79 is prop proportional ortional to the familiar eight t del decay p enalt enalty y, plus aprior termonthat ( w; 0,theI learning w eigh ts wdep . Ifend thisonprior is giv en y affect ), then the log-prior term Eq. w and do does es not depend do does es bnot pro process. cess. MAP Ba Bay yinesian w wweigh 5.79 is prop ortional to the prior familiar weigh t us decay p enalt y, plus a term that. N inference with a Gaussian on λthe eights ts th thus corresp corresponds onds to weigh weight t decay decay. does not depend on w and does not affect the learning process. MAP Bayesian As with fullaBay Bayesian esian inference, MAP Bay Bayesian esian has the adv advan an antage of. inference with Gaussian prior on the weigh ts thusinference corresponds to weigh t tage decay lev leveraging eraging information that is brough broughtt by the prior and cannot be found in the As with full This Bayesian inference, MAP Bayhelps esian to inference advantage of training data. additional information reducehas thethe variance in the leveraging is brough by the and cannot ber, e found MAP poin ointtinformation estimate (inthat comparison tot the ML prior estimate). How Howev ev ever, it do does esinsothe at training data. This additional information helps to reduce the v ariance in the the price of increased bias. MAP point estimate (in comparison to the ML estimate). However, it does so at Many regularized strategies, such as maxim maximum um lik likelihoo elihoo elihood d learning the Man pricey of increased estimation bias. regularized with weigh weightt deca decay y, can be in interpreted terpreted as making the MAP appro approximaximaMan y regularized estimation strategies, such as maxim um lik elihoo d learning tion to Ba Bay yesian inference. This view applies when the regularization consists of regularized with term weighto t deca , can be in terpreted as making the MAP p(θ ).ximaadding an extra the yob objective jective function that corresponds to logappro Not tion to Ba y esian inference. This view applies when the regularization consists of all regularization penalties corresp correspond ond to MAP Ba Bay yesian inference. For example, p(θ ). Not addingregularizer an extra term tomay the ob jective function that of corresponds to log some terms not be the logarithm a probability distribution. all regularization penalties correspon ondthe to data, MAPwhich Bayesian inference. For example, Other regularization terms depend of course a prior probability some regularizer may distribution is notterms allow allowed ed to not do. be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability MAP Bay Bayesian inference provides vides a straigh straightforw tforw tforward ard way to design complicated distribution isesian not allow ed topro do. yet in interpretable terpretable regularization terms. For example, a more complicated penalty esian inference vides a of straigh tforward way than to design complicated termMAP can bBay e derived by using pro a mixture Gaussians, rather a single Gaussian ydistribution, et interpretable regularization terms. F or example, a more complicated penalty as the prior (Nowlan and Hin Hinton ton, 1992). term can be derived by using a mixture of Gaussians, rather than a single Gaussian distribution, as the prior (Nowlan and Hinton, 1992).
5.7
Sup Supervised ervised Learning Algorithms
Recall 5.1.3 that sup supervised ervised Algorithms learning algorithms are, roughly sp speaking, eaking, 5.7 from SupSec. ervised Learning learning algorithms that learn to asso associate ciate some input with some output, giv given en a Recall from Sec. 5.1.3 that sup ervised learning algorithms are, roughly sp eaking, x y training set of examples of inputs and outputs . In man many y cases the outputs learning that learnautomatically to asso ciate some some output, en a y ma may y balgorithms e difficult to collect and input must with be provided by a giv human x andeven training set ofbut examples of still inputs outputs . Intraining many cases the outputs “sup “supervisor,” ervisor,” the term applies whenythe set targets were y ma y b e difficult to collect automatically and must be provided b y a h uman collected automatically automatically.. “supervisor,” but the term still applies even when the training set targets were 139 collected automatically.
CHAPTER 5. MACHINE LEARNING BASICS
5.7.1
Probabilistic Sup Supervised ervised Learning
Most learningSup algorithms this bo book ok are based on estimating a 5.7.1 supervised Probabilistic ervised in Learning probabilit probability y distribution p(y | x). We can do this simply by using maxim maximum um Most supervised learning algorithms in this bo ok are based on estimating lik likeliho eliho elihoo od estimation to find the best parameter vector θ for a parametric familya probabilit y distribution of distributions p(y | x; θ)p.(y x). We can do this simply by using maximum likelihood estimation to find the | best parameter vector θ for a parametric family W e ha hav v e already seen that linear regression corresp corresponds onds to the family of distributions p(y x; θ). We have already| seen that p(y linear | x; θ)regression = N (y; θ>corresp x, I ). onds to the family (5.80) p(y x; θ)to = the(yclassificat ; θ x, I ). ion scenario by defining (5.80)a We can generalize linear regression classification differen differentt family of probability |distributions. If we ha hav ve tw two o classes, class 0 and N W e can generalize linear regression to the classificat y defining a class 1, then we need only specify the probabilit probability y ofion onescenario of thesebclasses. The differen t family of 1probability e hav0,e b tw o classes, and probabilit probability y of class determinesdistributions. the probabilityIfofwclass ecause theseclass tw twoo v0alues class we1.need only specify the probability of one of these classes. The m ust 1, addthen up to probability of class 1 determines the probability of class 0, because these two values The distribution over real-v real-valued alued num numb bers that we used for linear must addnormal up to 1. regression is parametrized in terms of a mean. An Any y value we supply for this mean The A normal distribution over real-v alued numbersmore thatcomplicated, we used forbecause linear is valid. distribution over a binary variable is slightly regression parametrized inwterms of a 1. mean. weethis supply for this mean its mean misust alw alwaays be bet etw een 0 and One An wayy vtoalue solv solve problem is to use is v alid. A distribution o v er a binary v ariable is slightly more complicated, b ecause the logistic sigmoid function to squash the output of the linear function in into to the itsterv mean must alwain ysterpret be betw een v0alue and as 1. aOne way to solve this problem is to use in interv terval al (0, 1) and interpret that probability: the logistic sigmoid function to squash the output of the linear function into the interval (0, 1) and interpretpthat (5.81) (y = v1alue | x; as θ) a=probability: σ (θ >x). x). (5.81) p(gistic y = 1regr xession ; θ) = σ(a(θsomewhat This approac approach h is known as lo logistic gression strange name since we use the mo model del for classification rather | than regression). This approach is known as logistic regression (a somewhat strange name since we In the case of linear regression, we were able to find the optimal weigh eights ts by use the model for classification rather than regression). solving the normal equations. Logistic regression is somewhat more difficult. There In closed-form the case of linear regression, we were ablets.to Instead, find the we optimal eights for by is no solution for its optimal weigh weights. must wsearch solving the normal equations. Logistic regression is somewhat more difficult. There them by maximizing the log-lik log-likelihoo elihoo elihood. d. We can do this by minimizing the negative is no closed-form solution for its optimal log-lik log-likeliho eliho elihoo od (NLL) using gradient descen descent. t.weights. Instead, we must search for them by maximizing the log-likelihood. We can do this by minimizing the negative This same can bgradient e applied descen to essen essentially any y sup supervised ervised learning problem, log-lik eliho od strategy (NLL) using t.tially an by writing down a parametric family of conditional probability distributions over appliedvariables. to essentially any supervised learning problem, the This righ righttsame kindstrategy of inputcan andbeoutput by writing down a parametric family of conditional probability distributions over the right kind of input and output variables.
5.7.2
Supp Support ort Vector Mac Machines hines
5.7.2of the Supp Vectorapproaches Machines One mostort influential to sup supervised ervised learning is the supp support ort vector mac machine hine (Boser et al., 1992; Cortes and Vapnik, 1995). This mo model del is similar to One of the most influential approaches to sup ervised learning is the supp ort vector > logistic regression in that it is driven by a linear function w x + b. Unlik Unlike e logistic machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to logistic regression in that it is driven by 140 a linear function w x + b. Unlike logistic
CHAPTER 5. MACHINE LEARNING BASICS
regression, the support vector machine do does es not pro provide vide probabilities, but only outputs a class iden identit tit tity y. The SVM predicts that the positiv ositivee class is presen presentt when regression, the support vector machine do es not pro vide probabilities, but only > ositive. e. Likewise, it predicts that the negativ negativee class is presen present t when w x + b is positiv outputs identity. The SVM predicts that the positive class is present when > w x + baisclass negative. w x + b is positive. Likewise, it predicts that the negative class is present when One key inno innov vation asso associated ciated with supp support ort vector machines is the kernel trick. w x + b is negative. The kernel tric trick k consists of observing that many mac machine hine learning algorithms can One k ey inno v ation asso ciated with supp ort vector is the trick be written exclusiv exclusively ely in terms of dot pro products ducts betw between eenmachines examples. For kernel example, it. The tricthat k consists of observing hine algorithms can bkernel e shown the linear function that usedmany by themac supp support ortlearning vector mac machine hine cancan be b e written exclusiv ely in terms of dot pro ducts betw een examples. F or example, it re-written as m can be shown that the linear function usedX by the support vector machine can be > w x + b = b + αi x> x (i) (5.82) re-written as i=1
w x+b = b+ αx x (5.82) where x (i) is a training example and α is a vector of co coefficien efficien efficients. ts. Rewriting the learning algorithm this way allo allows ws us to replace x by the output of a giv given en feature x α where is a training example and is a vector of co efficien ts. Rewriting the ( i ) function φ(x ) and the dot pro product duct with a function k (x, x ) = φ( x) · φ (x(i) ) called xduct this warepresen y allowstsusan toinner replace by the output oftoaφgiv X pro alearning kernel.algorithm The · op operator erator represents product analogous (xen )>φfeature (x(i)). φ ( x k ( x , x ) = φ ( x ) ( x function ) and the dot pro duct with a function ) called φ For some feature spaces, we ma may y not use literally the vector inner pro product. duct. In a kernel . Thedimensional operator represen tse an inner proother duct analogous to· φpro ). ( x ) φ (x for some infinite spaces, w need to use kinds of inner products, ducts, For someinner feature spaces, we ma not use literally inner pro duct. In · pro example, products ducts based onyin integration tegration rather the thanvector summation. A complete some infinitet dimensional spaces, we pro need to use of inner dev developmen elopmen elopment of these kinds of inner products ducts is bother eyondkinds the scope of pro thisducts, bo book. ok.for example, inner pro ducts based on integration rather than summation. A complete After replacing dot products with kernel ev evaluations, aluations, we can make predictions development of these kinds of inner pro ducts is beyond the scope of this bo ok. using the function After replacing dot products with kX ernel evaluations, we can make predictions f (x) = b + α ik(x, x(i) ). (5.83) using the function i f (x) = b + α k(x, x ). (5.83) This function is nonlinear with resp respect ect to x, but the relationship betw between een φ( x) and f(x) is linear. Also, the relationship betw between een α and f (x) is linear. The This function is nonlinear with resp ect to , but thecessing relationship betw x φ( x) kernel-based function is exactly equiv equivalent alent to prepro preprocessing the data by een applying (xall and ) isinputs, linear.then Also, the relationship between α and f (x) is linear. The φ (x) fto learning a linear Xmodel in the new transformed space. kernel-based function is exactly equivalent to preprocessing the data by applying The kernel tric trick k is pow owerful erful for tw two o reasons. First, it allows us to learn mo models dels φ(x) to all inputs, then learning a linear model in the new transformed space. that are nonlinear as a function of x using conv convex ex optimization techniques that are Theteed kernel trickerge is pefficiently owerful for two reasons. it allows us to learn moand dels guaran guaranteed to conv converge efficiently. . This is possibleFirst, because we consider φ fixed that are nonlinear as athe function of x using convex optimization that are optimize only α, i.e., optimization algorithm can view thetechniques decision function guaran teed to conv erge efficiently . This is p ossible b ecause w e consider fixed and φ as being linear in a different space. Second, the kernel function k often admits α, i.e., optimize onlytation theisoptimization can view the decision an implemen that significantly algorithm more computational efficien naiv implementation efficient t thanfunction naively ely k as b eing linear in a different space. Second, the k ernel function often admits constructing two φ(x) vectors and explicitly taking their dot pro product. duct. an implementation that is significantly more computational efficient than naively In some cases, can even e infinite taking dimensional, whic which h duct. would result in constructing two φ(φx(x ) )vectors andbexplicitly their dot pro an infinite computational cost for the naiv naive, e, explicit approach. In man many y cases, can evenfunction be infinite whic h would result in φ(x)tractable k(xIn , x0 some φ ( x ) is a cases, nonlinear, of xdimensional, ev when ) is in even en intractable. tractable. As an infinite computational cost for the naive, explicit approach. In many cases, k(x, x ) is a nonlinear, tractable function 141 of x even when φ (x) is intractable. As
CHAPTER 5. MACHINE LEARNING BASICS
an example of an infinite-dimensional feature space with a tractable kernel, we construct a feature mapping φ (x ) over the non-negative in integers tegers x . Supp Suppose ose that an example of an infinite-dimensional feature space with a tractable kernel, we this mapping returns a vector con containing taining x ones follow followed ed by infinitely many zeros. φ ( x x construct a feature mapping over integers . Supp osealent that x(i)the ) = non-negative min min(( x, x(i) ) that W e can write a kernel function k)(x, is exactly equiv equivalent this mapping returns a v ector con taining ones follow ed by infinitely many zeros. x to the corresp corresponding onding infinite-dimensional dot pro product. duct. We can write a kernel function k (x, x ) = min( x, x ) that is exactly equivalent The most commonly used kernel is the Gaussian kernel to the corresponding infinite-dimensional dot product. The most commonly used k(ukernel , v) = is N the (u −Gaussian v ; 0, σ2I )kernel
(5.84)
) (5.84) k(u, v) = (u densit v ; 0,yσ. IThis where N( x; µ, Σ) is the standard normal density kernel is also known as the radial basis function (RBF) kernel, its value decreases along lines in N because − ( x ; µ , Σ where ) is the standard normal densit y. This kernel is also as outward ard from u . The Gaussian kernel corresp corresponds ondsknown to a dot v space radiating outw the duct radial function (RBF) kernel, because itsderiv value decreases in Ninbasis pro product an infinite-dimensional space, but the derivation ation of thisalong spacelines is less spacetforw radiating outw The kernel ondss. to a dot v u . of straigh straightforw tforward ard than in ard our from example theGaussian min kernel over corresp the integer integers. product in an infinite-dimensional space, but the derivation of this space is less We can think of the Gaussian kernel as performing a kind of template matching. straightforward than in our example of the min kernel over the integers. A training example x asso associated ciated with training lab label el y becomes a template for class W e can think of the Gaussian k ernel as p erforming a kind of template matching. 0 y . When a test poin ointt x is near x according to Euclidean distance, the Gaussian x asso y becomes A training ciated with training for class 0 isel kernel has example a large resp response, onse, indicating that xlab very similar atotemplate the x template. y x x . When a test p oin t is near according to Euclidean distance, the Gaussian The mo model del then puts a large weigh eightt on the asso associated ciated training lab label el y . Ov Overall, erall, k ernel has a large resp onse, indicating that is very similar to the template. x x the prediction will combin combinee many such training labels weigh eighted ted by the similarit similarity y The mocorresp del then putstraining a large w eight on the associated training label y . Overall, of the corresponding onding examples. the prediction will combine many such training labels weighted by the similarity Supp Support ort vector mac machines hines are not the only algorithm that can be enhanced of the corresponding training examples. using the kernel trick. Man Many y other linear mo models dels can be enhanced in this wa way y. The Supp ort v ector mac hines are not the only algorithm that can b e enhanced category of algorithms that emplo employ y the kernel tric trick k is kno known wn as kernel machines using the metho kernelds trick. Many other linear models can behölk enhanced in, this or kernel methods (Williams and Rasmussen , 1996 ; Sc Schölk hölkopf opf et al. al., 1999wa ). y. The category of algorithms that employ the kernel trick is known as kernel machines A ma major jor drawbac drawback k to kernel is that the; Sc cost ofopf ev evaluating aluating the ). decision or kernel metho ds (Williams andmachines Rasmussen , 1996 hölk et al., 1999 function is linear in the number of training examples, because the i-th example A ma jor adrawbac to kernel of evaluating themachines decision (i) machines is that the cost Support con term α k contributes tributes ort vector i k (x, x ) to the decision function. Supp i-th example function in the of training because the mostly are able is to linear mitigate thisnumber by learning an α examples, vector that contains zeros. α k ( x , x con tributes a term ) to the decision function. Supp ort vector machines Classifying a new example then requires ev evaluating aluating the kernel function only for are able to mitigate this by learning an vector training that contains mostly zeros. α the training examples that hav havee non-zero α i. These examples are known Classifying actors new. example then requires evaluating the kernel function only for as supp support ort ve vectors ctors. the training examples that have non-zero α . These training examples are known Kernel mac machines hines also suffer from a high computational cost of training when as support vectors. the dataset is large. We will revisit this idea in Sec. 5.9. Kernel mac machines hines with Kernel mac hines also suffer from a high computational cost of training generic kernels struggle to generalize well. We will explain why in Sec. 5.11.when The the dataset is large. W e will revisit this idea in Sec. 5.9 . Kernel mac hines with mo modern dern incarnation of deep learning was designed to overcome these limitations of generic kernels to generalize well. Wrenaissance e will explain whywhen in Sec. 5.11. etThe k ernel mac machines. hines.struggle The current deep learning began Hinton al. mo dern incarnation of deep learning was designed to o v ercome these limitations of (2006) demonstrated that a neural netw network ork could outperform the RBF kernel SVM k ernel mac hines. The current deep learning renaissance b egan when Hinton et al. on the MNIST benchmark. (2006) demonstrated that a neural network could outperform the RBF kernel SVM 142 on the MNIST benchmark.
CHAPTER 5. MACHINE LEARNING BASICS
5.7.3
Other Simple Supervised Learning Algorithms
W e ha hav ve Other already briefly encountered another non-probabilistic sup supervised ervised learning 5.7.3 Simple Supervised Learning Algorithms algorithm, nearest neigh neighb bor regression. More generally generally,, k-nearest neighbors is W e ha v e already briefly encountered another non-probabilistic ervised learning a family of tec techniques hniques that can be used for classification orsup regression. As a algorithm, nearest neigh b or regression. More generally , -nearest neighbors is k k non-parametric learning algorithm, -nearest neighbors is not restricted to a fixed aum family techniques that can be think used for classification or regression. As a k -nearest neigh n umb ber ofofparameters. We usually of the neighbors bors algorithm non-parametric learning algorithm, -nearest neighbors is anot restricted to aoffixed as not ha having ving an any y parameters, but krather implemen implementing ting simple function the k n um b er of parameters. W e usually think of the -nearest neigh bors algorithm training data. In fact, there is not even really a training stage or learning pro process. cess. as not ha ving an y parameters, but rather implemen ting a simple function of x, Instead, at test time, when we wan antt to pro produce duce an output y for a new test inputthe training data. In fact,neigh therebors is not even really a training ore learning process. k-nearest X. W w e find the neighb to x in the training datastage then return the y forwaorks atthe testcorresponding time, when we ant toinpro an output newfor test input x, yw aInstead, verage of values theduce training set. This essentially k x X w eyfind -nearest bors to thecan training . Weothen an any kindthe of sup supervised ervisedneigh learning whereinwe define data an average ver y return values.the In y a v erage of the corresponding v alues in the training set. This w orks for essentially the case of classification, we can av average erage ov over er one-hot code vectors c with cy = 1 any ckind offor supall ervised learning where we can define an average ver y ovvalues. In and = 0 other v alues of . W e can then in interpret terpret the avoerage er these i i c with c = 1 the caseco ofdes classification, we can average over one-hot code vectors one-hot codes as giving a probability distribution ov over er classes. As a non-parametric and c =algorithm, 0 for all other values of i. Wecan canac then terpret average ovexample, er these k-nearest learning neighbor achiev hiev hievee in very highthe capacit capacity y. For one-hot co des as giving a probability distribution ov er classes. As a non-parametric supp suppose ose we ha hav ve a multiclass classification task and measure performance with 0-1 learning algorithm, ac hieve to very high capacit or example, loss. In this setting,k-nearest 1-nearestneighbor neighborcan con conv verges double the Ba Bay yy.esFerror as the supp ose we ha v e a multiclass classification task and measure p erformance with num umb ber of training examples approac approaches hes infinit infinity y. The error in excess of the Ba Bay y0-1 es loss. In this setting, 1 -nearest neighbor con v erges to double the Ba y es error as the error results from cho hoosing osing a single neighbor by breaking ties betw etween een equally n um b er of training examples approac hes infinit y . The error in excess of Bayx es distan distantt neighbors randomly randomly.. When there is infinite training data, all testthe points ointsx errorha results from cman hoosing a single ties bIf etw een equally will hav ve infinitely many y training set neighbor neigh neighbors borsby at breaking distance zero. we allow the x distan t neighbors randomly . When there is infinite training data, all test p oints algorithm to use all of these neigh neighb bors to vote, rather than randomly choosing one will have the infinitely manyconv training neigh bors distance zero.high If wecapacity allow the of them, pro procedure cedure converges erges set to the Bay Bayes es at error rate. The of algorithm to use all of these neigh b ors to vote, rather than randomly choosing one k-nearest neigh neighbors bors allows it to obtain high accuracy giv given en a large training set. of them, pro convcomputational erges to the Bay es error high capacity of Ho How wev ever, er, the it do does es cedure so at high cost, and itrate. ma may y The generalize very badly k-nearest neighfinite bors allows to obtain accuracy given aneigh large training giv given en a small, trainingit set. One whigh eakness of k -nearest neighb bors is thatset. it Ho w ev er, it do es so at high computational cost, and it ma y generalize very badly cannot learn that one feature is more discriminativ discriminativee than another. For example, k -nearest giv en a small, finite training set. One eakness of wn bors is that it R100 dra imagine we hav havee a regression task with xw∈ drawn from anneigh isotropic Gaussian cannot learn that one feature discriminativ e than For Suppose example, x1Ris relev distribution, but only a singleis vmore ariable relevan an ant t to another. the output. imaginethat we hav regression task with drawn from, an further thise afeature simply enco encodes desx the output directly directly, i.e.isotropic that y = Gaussian x1 in all x distribution, but only a single v ariable is relev an t to the output. Suppose ∈ cases. Nearest neigh neighbor bor regression will not be able to detect this simple pattern. further that this feature simply enco theboutput directlyby , i.e. y =num x bin x will The nearest neighbor of most poin oints ts des e determined thethat large numb er all of cases. Nearest neighx bor ,regression will notfeature be ablex to detect this simple pattern. x 2 through features not by the lone . Th the output on small us 100 1 Thus The nearest of mostbepoin ts x will be determined by the large numb er of training setsneighbor will essentially random. features x through x , not by the lone feature x . Thus the output on small training sets will essentially be random.
143
CHAPTER 5. MACHINE LEARNING BASICS
R R
144
CHAPTER 5. MACHINE LEARNING BASICS
Another type of learning algorithm that also breaks the input space into regions and has separate parameters for each region is the de decision cision tr treee (Breiman et al., Another type of learning algorithm that also breaks the input into regions 1984 1984)) and its many varian ariants. ts. As shown in Fig. 5.7, eac each h no node de of space the decision tree and has separate parameters for each region is the de cision tr e e ( Breiman et al., is asso associated ciated with a region in the input space, and internal no nodes des break that region 1984 ) and its many v arian ts. As shown in Fig. 5.7 , eac h no de of the decision tree in into to one sub-region for each child of the no node de (typically using an axis-aligned is assoSpace ciated is with a region in the into inputnon-ov space,erlapping and internal nodeswith breaka that region cut). th thus us sub-divided non-overlapping regions, one-to-one in to one sub-region for leaf eachno cdes hildand of the noregions. de (typically using an usually axis-aligned corresp correspondence ondence b et etw ween nodes input Eac Each h leaf node maps cut). Space is th us sub-divided into non-ov erlapping regions, with a one-to-one ev every ery point in its input region to the same out output. put. Decision trees are usually corresp ondence b et w een leaf no des and input regions. Eachscope leaf node usually trained with sp specialized ecialized algorithms that are beyond the of this book.maps The every point in its can input to the same output.if it Decision trees are usually learning algorithm beregion considered non-parametric is allow allowed ed to learn a tree trained withsize, specialized algorithms that beyond the scope of this book. The of arbitrary though decision trees areare usually regularized with size constraints learning canparametric be considered non-parametric it is allowtrees ed to as learn a tree that turnalgorithm them into mo models dels in practice.if Decision they are of arbitrary size, though decision trees are usually regularized with size constraints typically used, with axis-aligned splits and constant outputs within eac each h no node, de, that turntothem parametric dels practice. Decision as they F are struggle solveinto some problems mo that areineasy even for logistictrees regression. or texample, ypically used, with axis-aligned splits and constant outputs within eac h no de, if we hav havee a two-class problem and the positive class occurs wherev wherever er struggle to solve some problems that are easy even for logistic regression. F or x2 > x1 , the decision boundary is not axis-aligned. The decision tree will th thus us example, if we hav e a t w o-class problem and the p ositive class o ccurs wherev er need to approximate the decision boundary with man many y no nodes, des, implementing a step x > x , that the decision boundary is not axis-aligned. tree will thus function constantly walks back and forth acrossThe the decision true decision function need axis-aligned to approximate the decision boundary with many nodes, implementing a step with steps. function that constantly walks back and forth across the true decision function we ha hav ve seen, havee man many y withAs axis-aligned steps.nearest neighbor predictors and decision trees hav limitations. Nonetheless, they are useful learning algorithms when computational As we are haveconstrained. seen, nearestWneighbor predictors and decision trees have many resources e can also build in intuition tuition for more sophisticated limitations. Nonetheless, they areab useful learning algorithms computational learning algorithms by thinking about out the similarities and when differences b et etw ween resources are constrained. W e can also build in tuition for more sophisticated sophisticated algorithms and k-NN or decision tree baselines. learning algorithms by thinking about the similarities and differences b etween See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine sophisticated algorithms and k-NN or decision tree baselines. learning textb textboooks for more material on traditional sup supervised ervised learning algorithms. See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine learning textbooks for more material on traditional supervised learning algorithms.
5.8
Unsup Unsupervised ervised Learning Algorithms
Recall Sec. 5.1.3 that unsup unsupervised ervised algorithms are those that exp experience erience only 5.8 from Unsup ervised Learning Algorithms “features” but not a sup supervision ervision signal. The distinction betw between een sup supervised ervised and Recall ervised from Sec. 5.1.3 thatisunsup ervised algorithms those that experience unsup unsupervised algorithms not formally and rigidlyaredefined because there isonly no “features” but not a sup ervision signal. The distinction betw een sup ervised and ob objective jective test for distinguishing whether a value is a feature or a target provided by unsup ervised Informally algorithms is notervised formally and rigidly defined ecause there is no a sup supervisor. ervisor. Informally, , unsup unsupervised learning refers to most battempts to extract ob jective testfrom for distinguishing whether a vnot alue require is a feature or a lab target provided by information a distribution that do human labor or to annotate a supervisor. Informally unsupervised learning to most attemptslearning to extract examples. The term is , usually asso associated ciated withrefers density estimation, to information from a distribution that do not require human lab or to annotate dra draw w samples from a distribution, learning to denoise data from some distribution, examples. The term is the usually with density the estimation, finding a manifold that dataasso liesciated near, or clustering data intolearning groups to of draw samples from a distribution, learning to denoise data from some distribution, 145 or clustering the data into groups of finding a manifold that the data lies near,
CHAPTER 5. MACHINE LEARNING BASICS
related examples. A classic unsup unsupervised ervised learning task is to find the “best” represen representation tation of the related examples. data. By ‘b ‘best’ est’ we can mean differen differentt things, but generally sp speaking eaking we are lo looking oking A classic unsup ervised learning task is to find the “best” represen tation of the for a representation that preserv preserves es as muc much h information ab about out x as possible while data. Bysome ‘best’penalty we canor mean differenaimed t things, but generally speaking we are looking ob obeying eying constraint at keeping the representation simpler or for a representation preserves as much information about x as possible while more accessible thanthat x itself. obeying some penalty or constraint aimed at keeping the representation simpler or There are multiple wa ways ys of defining a simpler represen representation. tation. Three of the more accessible than x itself. most common include lo low wer dimensional represen representations, tations, sparse representations There are m ultiple wa ys of defining a simpler tation. Three of the and indep independen enden endentt represen representations. tations. Low-dimensionalrepresen representations attempt to most common lower dimensional represen tations, sparse representations compress as minclude uc uch h information ab about out x as possible in a smaller represen representation. tation. and indep endentations t represen tations. Low-dimensional representations attempt to Sparse represen representations (Barlow , 1989 ; Olshausen and Field, 1996; Hin Hinton ton and x compress as ,m1997 uch )information out into as possible in atation smaller represen tation. Ghahramani embed the ab dataset a represen representation whose entries are Sparse represen tations ( Barlow , 1989 ; Olshausen and Field , 1996 ; Hin ton and mostly zero zeroes es for most inputs. The use of sparse representations typically requires Ghahramani , 1997 ) embed theofdataset into a represen whose entries are increasing the dimensionality the representation, so tation that the represen representation tation mostly zero es for zero mostesinputs. The use ofto sparse representations typically requires b ecoming mostly zeroes do does es not discard too o muc much h information. This results in an increasing the dimensionality of the representation, so that the represen tation overall structure of the represen representation tation that tends to distribute data along the axes b ecoming mostly zero es do es not discard o much tations information. Thistoresults in an of the represen representation tation space. Indep Independen enden endentttorepresen representations attempt disentangle overall structure of the represen tation to distribute data the along the axes the sources of variation underlying thethat datatends distribution suc such h that dimensions of the the representation representation are space. Independen t representations attempt to disentangle of statistically independent. the sources of variation underlying the data distribution such that the dimensions Of course these three criteria are certainly not mutually exclusive. Lo Lowwof the representation are statistically independent. dimensional representations often yield elements that hav havee fewer or weak eaker er deOf course these criteria are certainly not This mutually exclusive. pendencies than the three original high-dimensional data. is because one waLo y wto dimensional representations often yield elements that hav e fewer or w eak er dereduce the size of a representation is to find and remov removee redundancies. Identifying p endencies than the original high-dimensional data. This reduction is becausealgorithm one way to and remo removing ving more redundancy allows the dimensionality to reduce the size of a representation is to find and remov e redundancies. Identifying ac achiev hiev hievee more compression while discarding less information. and removing more redundancy allows the dimensionality reduction algorithm to The notioncompression of representation is one of the central themes of deep learning and achiev e more while discarding less information. therefore one of the central themes in this book. In this section, we dev develop elop some The notion of representation is one of the central themes of deep learning and simple examples of represen representation tation learning algorithms. Together, these example therefore one of the central themes in this b o ok. In this section, w e dev elop some algorithms show how to op operationalize erationalize all three of the criteria ab abov ov ove. e. Most of the simple examples of represen tation learning algorithms. T ogether,algorithms these example remaining chapters in intro tro troduce duce additional representation learning that algorithms show how to op erationalize all three of the criteria ab ov e. Most of the dev develop elop these criteria in different ways or in intro tro troduce duce other criteria. remaining chapters introduce additional representation learning algorithms that develop these criteria in different ways or introduce other criteria.
5.8.1
Principal Comp Componen onen onents ts Analysis
5.8.1 Principal ts Analysis In Sec. 2.12 , we sa saw w Comp that theonen principal comp components onents analysis algorithm pro provides vides a means of compressing data. We can also view PCA as an unsup unsupervised ervised learning In Sec. 2.12 , welearns saw that the principal onents analysis algorithm videsona algorithm that a represen representation tation comp of data. This represen representation tation ispro based We can also tation view PCA as ed an ab unsup tmeans wo of of thecompressing criteria for data. a simple represen representation describ described abov ov ove. e.ervised PCA learning learns a algorithm that learns a representation of data. This representation is based on two of the criteria for a simple represen 146tation describ ed ab ove. PCA learns a
CHAPTER 5. MACHINE LEARNING BASICS
x z= x W z= x W
z z
x
z
z
represen representation tation that has low lower er dimensionalit dimensionality y than the original input. It also learns a represen representation tation whose elemen elements ts hav havee no linear correlation with each other. This represen that er dimensionalit y than the original also learns is a firsttation step tow toward ardhas thelow criterion of learning represen representations tations input. whose It elemen elements ts are a represen tation whose elemen ts hav e no linear correlation with each other. This statistically indep independent. endent. To achiev achievee full independence, a represen representation tation learning is a first step tow ard the criterion of learning represen tations whose elemen ts are algorithm must also remov removee the nonlinear relationships bet etw ween variables. statistically independent. To achieve full independence, a representation learning PCA learns an orthogonal, linear transformation of the data that pro projects jects an algorithm must also remove the nonlinear relationships between variables. input x to a represen representation tation z as sho shown wn in Fig. 5.8. In Sec. 2.12, we saw that we PCA learns an orthogonal, linear transformation of the data that the pro jects an could learn a one-dimensional represen representation tation that best reconstructs original x z input to a represen tation as sho wn in Fig. 5.8 . In Sec. 2.12 , w e saw that w e data (in the sense of mean squared error) and that this represen representation tation actually could learn represen tation that est reconstructs theuse original corresp the first principal comp thebdata. Th PCA corresponds onds atoone-dimensional componen onen onent t of Thus us we can data (in the sense of mean squared error) and that this represen tation actually as a simple and effectiv effectivee dimensionalit dimensionality y reduction metho method d that preserv preserves es as muc much h corresp onds to the first principal comp onen t of the data. Th us w e can use PCA of the information in the data as possible (again, as measured by least-squares as a simple anderror). effectiv dimensionalit metho thatPCA preserv es as much reconstruction Ine the follo following, wing,ywreduction e will study ho how wdthe representation of the information in thedata datarepresentation as possible (again, as measured by least-squares decorrelates the original X. reconstruction error). In the following, we will study how the PCA representation Let us consider the m × n -dimensional design matrix X . We will assume that decorrelates the original data representation X . the data has a mean of zero, E[ x] = 0. If this is not the case, the data can easily n -dimensional Let us consider the m the design matrix X . preprocessing We willcessing assume that be centered by subtracting step. E mean from all examples in a prepro the data has a mean of zero, × [ x] = 0. If this is not the case, the data can easily The un unbiased biased sample cov covariance ariance asso associated ciated with is giv given en by: step. be centered by subtracting the meanmatrix from all examples in a X prepro cessing 1 asso>ciated with X is given by: The unbiased sample covariance matrix Var[x] = X X. (5.85) m−1 1 Var[x] = 147 X X . (5.85) m 1 −
CHAPTER 5. MACHINE LEARNING BASICS
PCA finds a represen representation tation (through linear transformation) z = x>W where Var[z ] is diagonal. PCA finds a representation (through linear transformation) z = x W where In Sec. 2.12, we sa saw w that the principal comp components onents of a design matrix X are Var[z ] is diagonal. > giv given en by the eigenv eigenvectors ectors of X X . From this view, In Sec. 2.12, we saw that the principal components of a design matrix X are > = Wthis ΛWview, given by the eigenvectors of XXX>.XFrom . (5.86) X X e=deriv WΛ W of . the principal components. (5.86) In this section, we exploit an alternativ alternative derivation ation The principal comp componen onen onents ts may also be obtained via the singular value decomp decomposition. osition. In this section, we exploit an alternativ e deriv ation of the principal components. Sp Specifically ecifically ecifically,, they are the righ rightt singular vectors of X . To see this, let W be The the principal comp onen ts may also b e obtained via the singular v alue decomp osition. > righ rightt singular vectors in the decomp decomposition osition X = U ΣW . W Wee then recov recover er the Specifically , they areequation the righwith t singular of X . Tobasis: see this, let W be the original eigen eigenv vector W asvectors the eigenv eigenvector ector right singular vectors in the decomposition X = U ΣW . We then recover the > original eigenvector > > as the eigenv Xequation X = Uwith ΣWW U ΣW > = ector W Σ2basis: W >. (5.87)
X X = U ΣW U ΣW = W Σ W . (5.87) ar[[ z]. Using the The SVD is helpful to show that PCA results in a diagonal Var SVD of X , we can express the variance of X as: The SVD is helpful to show that PCA results in a diagonal Var[ z]. Using the 1 > X as: SVD of X , we can express Var[xthe X (5.88) ] =variance Xof m−1 1 Var[x] = = 1 (X XW > )>U ΣW > (5.88) UΣ (5.89) m m−1 1 1 > = 1− W (U Σ>W >) U ΣW (5.89) = (5.90) m − 11 Σ U U ΣW m 1 = 1− W W Σ2 U >U ΣW (5.90) (5.91) = m − 11 Σ W , m 1− (5.91) = WΣ W , > where we use the fact that U Um= I 1because the U matrix of the singular value definition is defined to be orthonormal. This shows that if we tak takee z = x> W , we −I because U U = U where w e use the fact that the matrix of the singular value can ensure that the cov covariance ariance of z is diagonal as required: definition is defined to be orthonormal. This shows that if we take z = x W , we 1 can ensure that the covariance Var[z ] =of z is diagonal Z >Z as required: (5.92) m−1 1 Var[z ] = = 1 W Z >ZX >X W (5.92) (5.93) m − 11 m 11− W X> X2 W > = (5.93) (5.94) = m 1WW Σ WW m−1 1 (5.94) = 1− Σ W2W Σ W W = (5.95) m − 11 , m 1 = − Σ> , (5.95) where this time we use the fact that m W 1 W = I , again from the definition of the SVD. where this time we use the fact that−W W = I , again from the definition of the SVD.
148
CHAPTER 5. MACHINE LEARNING BASICS
The ab abov ov ovee analysis shows that when we pro project ject the data x to z, via the linear transformation W , the resulting representation has a diagonal co cov variance matrix x z The ab ov e analysis shows that when we pro ject the data to , via ts theoflinear 2 (as giv given en by Σ ) whic which h immediately implies that the individual elemen elements z are W transformation , the resulting representation has a diagonal co v ariance matrix mutually uncorrelated. (as given by Σ ) which immediately implies that the individual elements of z are This ability of PCA to transform data into a representation where the elemen elements ts mutually uncorrelated. are mutually uncorrelated is a very imp important ortant prop property erty of PCA. It is a simple This ability of PCA to transform data to into a representation where the elements example of a represen representation tation that attempt are mutually underlying uncorrelated is data. a veryInimp prop ertythis of PCA. It is a simple the theortant case of PCA, disen disentangling tangling takes example of a represen tation that attempt to the form of finding a rotation of the input space (describ (described ed by W ) that aligns the the data. In the casenew of PCA, this disen tangling takes principal axes underlying of variance with the basis of the representation space asso associated ciated the form with z. of finding a rotation of the input space (described by W ) that aligns the principal axes of variance with the basis of the new representation space associated While correlation is an imp important ortant category of dep dependency endency b et etw ween elements of with z. the data, we are also in interested terested in learning represen representations tations that disentangle more While correlation is an imp ortant category depwe endency b etwmore een elements of complicated forms of feature dep dependencies. endencies. Forofthis, will need than what the data, we with are also interested learning representations that disentangle more can be done a simple linearintransformation. complicated forms of feature dependencies. For this, we will need more than what can be done with a simple linear transformation.
5.8.2
k-means Clustering k 5.8.2 example -means Another of aClustering simple representation learning algorithm is k -means clustering.
The k -means clustering algorithm divides the training set in into to k differen differentt clusters Another example of a simple representation learning algorithm clustering. of examples that are near eac each h other. We can thus think isofk -means the algorithm as k k The -means clustering algorithm divides the training set in to differen t clusters pro providing viding a k-dimensional one-hot co code de vector h represen representing ting an input x. If x of examples that are near eac h other. W e can thus think of the algorithm as belongs to cluster i , then h i = 1 and all other en entries tries of the represen representation tation h are providing a k-dimensional one-hot code vector h representing an input x. If x zero. belongs to cluster i , then h = 1 and all other entries of the representation h are The one-hot co code de provided by k-means clustering is an example of a sparse zero. represen representation, tation, because the ma majority jority of its entries are zero for ev every ery input. Later, k The one-hot co de provided by -means clustering is an example of a sparse we will dev develop elop other algorithms that learn more flexible sparse representations, represen tation, because the jority of its entries for xev. ery input. co Later, where more than one en can be non-zero for are eac input One-hot entry tryma each hzero codes des w e will develop example other algorithms learn more flexible sparse are an extreme of sparsethat representations that lose manyrepresentations, of the b enefits where more thanrepresentation. one entry can The be non-zero eacstill h input One-hot codes x . some of a distributed one-hot for co code de confers statistical are an extreme examplecon of vsparse that loseinmany of the b enefits adv advantages antages (it naturally conv eys therepresentations idea that all examples the same cluster are of a distributed representation. The one-hot co de still confers some statistical similar to each other) and it confers the computational adv advantage antage that the en entire tire advantages (it naturally conveys the that all examples in the same cluster are represen representation tation ma may y be captured by aidea single in integer. teger. similar to each other) and it confers the computational advantage that the entire The ktation -meansma algorithm works by differentt cen centroids troids {µ(1), . . . , µ(k) } k differen represen y be captured by initializing a single integer. to different values, then alternating betw etween een two different steps un until til con conv vergence. The -means algorithm works by initializing differen t cen troids k k , . . . , µ of µ i i In one step, each training example is assigned to cluster , where is the index to different alues, betw een eac twohdifferent convergence. {tildated µ (i)alternating µ(i) isunup the nearest vcen centroid troidthen . In the other step, each cen centroid troidsteps updated to the} i i In one step, each training example is assigned to cluster , where is the index of ( j ) mean of all training examples x assigned to cluster i. the nearest centroid µ . In the other step, each centroid µ is updated to the mean of all training examples x assigned 149 to cluster i.
CHAPTER 5. MACHINE LEARNING BASICS
One difficulty pertaining to clustering is that the clustering problem is inherently ill-p ill-posed, osed, in the sense that there is no single criterion that measures ho how w well a One difficulty pertaining to clustering is that the clustering problem is inherently clustering of the data corresp corresponds onds to the real world. We can measure properties of ill-p osed, in the sense that there is no single criterion that measures how well a the clustering suc such h as the average Euclidean distance from a cluster cen centroid troid to the clustering of the data corresp onds to the real w orld. W e can measure properties of mem memb bers of the cluster. This allows us to tell how well we are able to reconstruct the clustering such from as thethe average Euclidean distance a cluster troid to the the the training data cluster assignmen assignments. ts. Wefrom do not kno know wcen how well members of the cluster. Thisond allows us to tell how well weware able to reconstruct cluster assignments corresp correspond to properties of the real orld. Moreo Moreov ver, there the training data from the cluster assignmen ts. W e do not kno w how well ma may y be man many y different clusterings that all corresp correspond ond well to some prop propert ert erty ythe of cluster assignments corresp ond to properties of the real w orld. Moreo v er, there the real world. We may hop hopee to find a clustering that relates to one feature but may beaman y different clusterings that all that corresp ond relev well ant to some ertyFor of obtain differen different, t, equally valid clustering is not relevant to ourprop task. the real wsupp orld.ose Wthat e may to ofind a clustering that relates to oneconsisting feature but example, suppose wehop rune tw two clustering algorithms on a dataset of obtain a differen t, equally v alid clustering that is not relev ant to our task. F or images of red truc trucks, ks, images of red cars, images of gra gray y trucks, and images of gra gray y example, supp ose that w e run tw o clustering algorithms on a dataset consisting of cars. If we ask each clustering algorithm to find two clusters, one algorithm ma may y images of red truc ks, images of red cars, images of gra y trucks, and images of graofy find a cluster of cars and a cluster of trucks, while another ma may y find a cluster cars.vehicles If we ask clustering algorithm find twowclusters, algorithm may red andeach a cluster of gray vehicles.toSuppose e also runone a third clustering find a cluster of cars and cluster of trucks, while may This find amay cluster of algorithm, which is allo allow weda to determine the num number beranother of clusters. assign red vehicles and a cluster of gray v ehicles. Suppose w e also run a third clustering the examples to four clusters, red cars, red truc trucks, ks, gra gray y cars, and gra gray y trucks. This algorithm, which is at alloleast wed captures to determine the numab ber of clusters. This may new clustering now information b oth attributes, but assign it has about out the examples to four clusters, red cars, red truc ks, gra y cars, and gra y trucks. lost information about similarit similarity y. Red cars are in a different cluster from This gra gray y new clustering now at least captures information ab out b oth attributes, but it has cars, just as they are in a differen differentt cluster from gray trucks. The output of the lost information about similarit y. us Red cars in are a different cluster clustering algorithm do does es not tell that redare cars more similar to from gra gray y gra carsy cars, they just as are truc in aks. differen cluster from gray btrucks. The and output than arethey to gray trucks. Theyt are differen different t from oth things, thatofisthe all clustering w e kno know. w. algorithm does not tell us that red cars are more similar to gray cars than they are to gray trucks. They are different from both things, and that is all These may y prefer a distributed we kno w. issues illustrate some of the reasons that we ma represen representation tation to a one-hot represen representation. tation. A distributed represen representation tation could ha hav ve issuesforillustrate some of therepresenting reasons thatitswe mayand prefer distributed twoThese attributes each vehicle—one color one arepresenting representation a one-hot represen distributed tation could have whether it is atocar or a truck. Ittation. is stillAnot en entirely tirely represen clear what the optimal two attributes for each vehicle—one color andknow one representing distributed represen representation tation is (ho (how w canrepresenting the learningitsalgorithm whether the whether it is a car or a truck. It is still not en tirely clear what the optimal two attributes we are in interested terested in are color and car-v car-versus-truck ersus-truck rather than distributed represen tation is (ho w can the learning algorithm know whether the man and age?) but having man attributes reduces the burden on the manufacturer ufacturer many y talgorithm wo attributes we are in terested in are color and car-v ersus-truck rather than to guess whic which h single attribute we care ab about, out, and allo allows ws us to measure man ufacturer and age?) but having man y attributes reduces the burden on the similarit similarity y betw between een ob objects jects in a fine-grained way by comparing many attributes algorithm to guess whic h single attribute we care ab out, and allo ws us to measure instead of just testing whether one attribute matc matches. hes. similarity between objects in a fine-grained way by comparing many attributes instead of just testing whether one attribute matches.
5.9
Sto Stocchastic Gradient Descent
Nearly of cdeep learning is pow owered ered Descent by one very imp importan ortan ortantt algorithm: sto stochastic chastic 5.9 all Sto hastic Gradient gr gradient adient desc descent ent or SGD. Sto Stocchastic gradient descent is an extension of the gradient Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD. Stochastic gradient 150 descent is an extension of the gradient
CHAPTER 5. MACHINE LEARNING BASICS
descen descentt algorithm introduced in Sec. 4.3. A recurring problem in mac machine hine learning is that large training sets are necessary descent algorithm introduced in Sec. 4.3. for go goo o d generalization, but large training sets are also more computationally A recurring problem in machine learning is that large training sets are necessary exp expensiv ensiv ensive. e. for goo d generalization, but large training sets are also more computationally The cost function used by a machine learning algorithm often decomp decomposes oses as a expensive. sum over training examples of some per-example loss function. For example, the Theecost functionlog-likelihoo used by a machine decomp negativ negative conditional log-likelihood d of thelearning trainingalgorithm data can often be written asoses as a sum over training examples of some per-example loss function. For example, the m negative conditional log-likelihood of the training be written as 1 Xdata can J (θ) = E ,y∼ˆp L(x, y, θ) = L(x(i) , y (i), θ) (5.96) m 1 i=1 E J (θ) = L(x, y, θ) = L(x , y , θ) (5.96) mlog p(y | x; θ ). where L is the per-example loss L(x, y, θ) = −
ForLthese additive e cost functions, gradient where is theadditiv per-example loss L(x, ygradien , θ) = t descent log p(y requires x; θ ). computing m X |requires computing For these additive cost functions, gradien− t descent 1 X (5.97) ∇θ J (θ) = ∇ θ L(x(i) , y (i), θ). m 1 i=1 (5.97) J (θ ) = L(x , y , θ). m O ( m The computational cost ∇ of this op operation eration is ) . As the training set size grows to ∇ billions of examples, the time to take a single gradien gradientt step becomes prohibitiv prohibitively ely O ( m The computational cost of this op eration is ) . As the training set size grows to long. billions of examples, the time to takeX a single gradient step becomes prohibitively The insight of sto stocchastic gradient descen descentt is that the gradien gradientt is an exp expectation. ectation. long. The exp expectation ectation may be appro approximately ximately estimated using a small set of samples. The insight of stostep chastic gradient descenwteiscan that the gradien t isatch an exp ectation. Sp Specifically ecifically ecifically, , on each of the algorithm, sample a minib minibatch of examples 0 The ectation may ) } be approximately estimated using a small set of samples. , . . . , x (m x(1) B = {exp dra drawn wn uniformly from the training set. The minibatc minibatch h size Sp ecifically , on each step of the algorithm, w e can sample a minib atch of examples 0 m number ber of examples, ranging from B is typically chosen to be a relatively small num = a xfew, .h.undred. .,x drawn uniformly from the minibatc h size m 1 to Crucially Crucially, , m0 is usually heldtraining fixed asset. the The training set size m is t ypically c hosen to b e a relatively small num ber of examples, ranging from { } gro grows. ws. We may fit a training set with billions of examples using updates computed 1ontoonly a few hundred.examples. Crucially, m is usually held fixed as the training set size m a hundred grows. We may fit a training set with billions of examples using updates computed The estimate of the gradient is formed as on only a hundred examples. m0 The estimate of the gradient1 is formed as X g = 0 ∇θ L(x(i) , y(i) , θ). (5.98) m i=1 1 g= L(x , y , θ). (5.98) m B. The stochastic gradien using examples from the minibatch descen algorithm gradientt descentt ∇ then follo follows ws the estimated gradient B do downhill: wnhill: using examples from the minibatch . The stochastic gradient descent algorithm then follows the estimated gradientθ do wnhill: ← θ − g , (5.99) X
where is the learning rate. where is the learning rate.
θ
θ
← 151−
g ,
(5.99)
CHAPTER 5. MACHINE LEARNING BASICS
Gradien Gradientt descen descentt in general has often been regarded as slow or unreliable. In the past, the application of gradien gradientt descen descentt to non-conv non-convex ex optimization problems Gradien t descen t in general has often b een regarded slow or unreliable. In was regarded as foolhardy or unprincipled. Toda day y, we as kno know w that the mac machine hine the past, mo thedels application of in gradien ex optimization learning models describ described ed Part tIIdescen work tvto erynon-conv well when trained with problems gradient w as regarded as foolhardy or unprincipled. T o da y , w e kno w that the mac hinea descen descent. t. The optimization algorithm may not be guaranteed to arrive at even learning modelsindescrib ed in Pamount art II work verybut wellit when gradient lo local cal minimum a reasonable of time, often trained finds a vwith ery lo low w value descen t. The optimization algorithm may not b e guaranteed to arrive at even a of the cost function quickly enough to be useful. local minimum in a reasonable amount of time, but it often finds a very low value Sto Stocchastic gradient descen descentt has man many y imp important ortant uses outside the con context text of of the cost function quickly enough to be useful. deep learning. It is the main way to train large linear mo models dels on very large Stochastic descen t has manyper imp ortant usesdo outside context of datasets. For agradient fixed mo model del size, the cost SGD up update date does es not the dep depend end on the deep learning. It. is main we waoften y to train large linear mothe delstraining on very training set size m In the practice, use a larger mo model del as setlarge size datasets. F or a fixed mo del size, the cost p er SGD up date do es not dep end on the increases, but we are not forced to do so. The num number ber of up updates dates required to reach m training set size . In practice, we often use a larger mo del as the training set size con conv vergence usually increases with training set size. Ho How wev ever, er, as m approac approaches hes increases, butmo wedel arewill notev forced to do so. Theto num of pup dates required infinit infinity y, the model even en entually tually con converge verge itsber best ossible test errortobreach efore m approac convergence usually increases with training set set. size.Increasing However,masfurther hes SGD has sampled ev every ery example in the training will not infinity,the theamount model of willtraining eventually vergetotoreac itshbthe est p ossible error before extend time con needed reach mo model’s del’s btest est possible test m SGD has sampled ev ery example in the training set. Increasing further will not error. From this point of view, one can argue that the asymptotic cost of training amount of O training needed aextend mo model delthe with SGD is (1) as atime function of to m.reach the model’s best possible test error. From this point of view, one can argue that the asymptotic cost of training Prior to the adv adven en ent t (1) of deep themmain wa way y to learn nonlinear models a mo del with SGD is O as a learning, function of . was to use the kernel trick in com combination bination with a linear mo model. del. Man Many y kernel learning Prior to the adv en t of deep learning, the main wa y to learn nonlinear models ( i ) ( j ) algorithms require constructing an m × m matrix Gi,j = k (x , x ). Constructing w as to use the trick in com bination a linear model. Many kernel learning O (m 2)with this matrix haskernel computational cost , which is clearly undesirable for datasets m m matrix G in = k2006, (x , xdeep algorithms require constructing ). learning Constructing with billions of examples. In an academia, starting was O ( m this matrix has computational cost ) , which is clearly undesirable for datasets × initially interesting because it was able to generalize to new examples better with billions of algorithms examples. when In academia, starting in 2006,datasets deep learning was than comp competing eting trained on medium-sized with tens of initially interesting because was deep able learning to generalize to new examples better thousands of examples. Soon itafter, garnered additional interest in than comp eting algorithms trained datasets withon tens of industry industry, , because it pro provided videdwhen a scalable waon y ofmedium-sized training nonlinear models large thousands of examples. Soon after, deep learning garnered additional interest in datasets. industry, because it provided a scalable way of training nonlinear models on large Sto Stocchastic gradien gradientt descen descentt and man many y enhancements to it are describ described ed further datasets. in Chapter 8. Stochastic gradient descent and many enhancements to it are described further in Chapter 8.
5.10
Building a Mac Machine hine Learning Algorithm
Nearly deep learning algorithms be describ described ed as particular instances of 5.10 allBuilding a Mac hinecan Learning Algorithm a fairly simple recip recipe: e: combine a specification of a dataset, a cost function, an Nearly all deep learning algorithms optimization pro procedure cedure and a mo model. del. can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an For example, the linear optimization procedure andregression a model. algorithm combines a dataset consisting of For example, the linear regression algorithm combines a dataset consisting of 152
CHAPTER 5. MACHINE LEARNING BASICS
X and y , the cost function X and y , the cost function J (w, b) = −E ,y∼ˆp (5.100) log pmodel(y | x), E J (w , b) =(y | x ) = N (ylog (y1) ), (5.100) pmodel ; x>pw + b, the mo model del sp specification ecification 1),x , and, in most cases, the optimization algorithm defined b− y solving for where the |gradien gradientt of the cost is zero p ( ) = ( y ; x w + b, the mo del sp ecification 1) , and, in most cases, the y x using the normal equations. optimization algorithm defined by| solving N for where the gradient of the cost is zero By realizing that w e can replace an any y of these comp componen onen onents ts mostly independently using the normal equations. from the others, we can obtain a very wide variety of algorithms. By realizing that we can replace any of these components mostly independently The cost function typically includes at least one term that causes the learning from the others, we can obtain a very wide variety of algorithms. pro process cess to perform statistical estimation. The most common cost function is the Theecost function d, typically at least term that causes learning negativ negative log-likelihoo log-likelihood, so thatincludes minimizing theone cost function causesthe maximum pro cessoto perform statistical estimation. The most common cost function is the lik likeliho eliho elihoo d estimation. negative log-likelihood, so that minimizing the cost function causes maximum The cost function ma may y also include additional terms, suc such h as regularization likelihood estimation. terms. For example, we can add weigh eightt deca decay y to the linear regression cost function The cost function may also include additional terms, such as regularization to obtain terms. For example, y tolog thep linear regression cost function J (ww,eb)can = λadd ||ww ||22eigh − Et deca (5.101) ,y∼p model(y | x). to obtain E This still allows closed-form log p J (w, b) = λoptimization. w (y x). (5.101) we callows hange closed-form the mo model del tooptimization. then most cost functions can no longer ||be ||nonlinear, − | ThisIfstill be optimized in closed form. This requires us to choose an iterativ iterativee numerical If w e c hange the mo del to b e nonlinear, then most cost functions can no longer optimization pro procedure, cedure, such as gradien gradientt descen descent. t. be optimized in closed form. This requires us to choose an iterative numerical The recip recipeepro forcedure, constructing a learning combining bining mo models, dels, costs, and optimization such as gradientalgorithm descent. by com optimization algorithms supp supports orts both sup supervised ervised and unsup unsupervised ervised learning. The The recipe forexample constructing algorithm bervised y combining models, costs, and linear regression sho shows wsa learning ho how w to supp support ort sup supervised learning. Unsup Unsupervised ervised optimization orts botha dataset supervised unsup ervised learning. The X and learning can balgorithms e supp supported ortedsupp by defining thatand con contains tains only onlyX providing linear regression example sho ws ho w to supp ort sup ervised learning. Unsup ervised an appropriate unsup unsupervised ervised cost and mo model. del. For example, we can obtain the first learning can bby e supp orted by defining a dataset that PCA vector sp specifying ecifying that our loss function is contains only X and providing an appropriate unsupervised cost and model. For example, we can obtain the first PCA vector by sp ecifyingJ (that loss function is; w)|| 22 ||x − r(x w) =our E ∼p (5.102) E r(x ; w)and reconstruction function J (to w)hav = e w with xnorm (5.102) while our mo model del is defined have one r(x) = w >xw xw.. || − || while our model is defined to have w with norm one and reconstruction function cases, the cost function may be a function that we cannot actually r(x)In=some w xw . ev evaluate, aluate, for computational reasons. In these cases, we can still approximately In some cases,iterativ the cost function optimization may be a function cannot minimize it using iterative e numerical so longthat as ww e ehav have e someactually wa way y of ev aluate, for computational reasons. In these cases, we can still approximately appro approximating ximating its gradients. minimize it using iterative numerical optimization so long as we have some way of Most mac machine hine learning algorithms make use of this recipe, though it ma may y not approximating its gradients. immediately be obvious. If a mac machine hine learning algorithm seems esp especially ecially unique or Most machine learning algorithms make use of this recipe, though it may not 153 immediately be obvious. If a machine learning algorithm seems esp ecially unique or
CHAPTER 5. MACHINE LEARNING BASICS
hand-designed, it can usually be understo understoood as using a sp special-case ecial-case optimizer. Some mo models dels suc such h as decision trees or k -means require sp special-case ecial-case optimizers because hand-designed, it can usually be understo od as them using inappropriate a special-case for optimizer. Some their cost functions ha hav ve flat regions that make minimization k mo dels such as decision trees or -means require special-case because b y gradient-based optimizers. Recognizing that most machine optimizers learning algorithms their cost functions ha v e flat regions that make them inappropriate for minimization can be describ described ed using this recipe helps to see the different algorithms as part of a by gradient-based optimizers. that most machine learning algorithms taxonom taxonomy y of metho methods ds for doingRecognizing related tasks that work for similar reasons, rather can b e describ ed using this recipe helps to see the different algorithms as part of a than as a long list of algorithms that eac each h ha hav ve separate justifications. taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.
5.11
Challenges Motiv Motivating ating Deep Learning
The simple mac machine hine learning algorithms describ described ed in Learning this chapter work very well on 5.11 Challenges Motiv ating Deep a wide variet ariety y of important problems. Ho How wev ever, er, they ha hav ve not succeeded in solving The simple mac hine learning algorithms describ ed in this chapter work very well on the cen central tral problems in AI, such as recognizing sp speec eec eech h or recognizing ob objects. jects. a wide variety of important problems. However, they have not succeeded in solving dev development elopmentinofAI, deep learning was motiv motivated ated inrecognizing part by theobfailure the The central problems such as recognizing speec h or jects. of traditional algorithms to generalize well on suc such h AI tasks. The development of deep learning was motivated in part by the failure of This section is ab about outtohow the challenge of suc generalizing to new examples becomes traditional algorithms generalize well on h AI tasks. exp exponen onen onentially tially more difficult when working with high-dimensional data, and how section isused abouttohow the echallenge of generalizing to new examples ecomes the This mec ac generalization in traditional mac learning mechanisms hanisms achiev hiev hieve machine hine b exponen tially tmore difficult when working withinhigh-dimensional andSuch how are insufficien insufficient to learn complicated functions high-dimensionaldata, spaces. the mec hanisms used tohigh achiev e generalization traditional hine learning spaces also often impose computational costs.in Deep learningmac was designed to are insufficien t to learn complicated functions in high-dimensional spaces. Such overcome these and other obstacles. spaces also often impose high computational costs. Deep learning was designed to overcome these and other obstacles.
5.11.1
The Curse of Dimensionalit Dimensionality y
5.11.1 The Curse of Dimensionalit y Man Many y mac machine hine learning problems become exceedingly difficult when the num numb b er of dimensions in the data is high. This phenomenon is kno known wn as the curse Man y mac hine learning problems b ecome exceedingly difficult when the distinct numb er of dimensionality dimensionality.. Of particular concern is that the num umb ber of possible of dimensions in the data is high. This phenomenon is kno wn as the curse configurations of a set of variables increases exp exponentially onentially as the num numb b er of variables of dimensionality. Of particular concern is that the number of possible distinct increases. configurations of a set of variables increases exponentially as the numb er of variables increases.
154
CHAPTER 5. MACHINE LEARNING BASICS
× O(× v )
= 1000 10 d
v
d
v
= 1000 10
O( v )
The curse of dimensionality arises in many places in computer science, and esp especially ecially so in machine learning. The curse of dimensionality arises in many places in computer science, and One posed learning. by the curse of dimensionality is a statistical challenge. especiallychallenge so in machine As illustrated in Fig. 5.9, a statistical challenge arises because the num number ber of One challenge posed by the curse of dimensionality is a statistical challenge. possible configurations of x is much larger than the num umb ber of training examples. As illustrated in Fig. 5.9 , a statistical c hallenge arises because the number To understand the issue, let us consider that the input space is organized in into to of a possible isw m uch larger w than thedescrib number of training examples. grid, lik likeeconfigurations in the figure. of Inxlo low dimensions e can describe e this space with a low T understand let mostly us consider that bthe input space is organized into no um umb ber of grid the cellsissue, that are occupied y the data. When generalizing to aa grid,data like in the low dimensions e can describ this spacethe with a low new poin oint, t, figure. we can In usually tell what towdo simply by einspecting training n um b er of grid cells that are mostly o ccupied b y the data. When generalizing to a examples that lie in the same cell as the new input. F For or example, if estimating newprobabilit data poinyt,densit we can do just simply by inspecting x, wetocan the probability density y atusually some ptell ointwhat return the num numb berthe of training training examples that lie in the same cell as the new input. F or example, if estimating examples in the same unit volume cell as x , divided by the total num umb ber of training x the probabilit y densit y at some p oint , w e can just return the num b er of training examples. If we wish to classify an example, we can return the most common class x examples in the same unit volume cell as , divided b y the total n um b er of training of training examples in the same cell. If we are doing regression we can av average erage examples. If w e wish to classify an example, we can return the most common the target values observ observed ed over the examples in that cell. But what ab about outclass the of training examples the same cell. If we are doing regression we can average cells for which we ha hav vin e seen no example? Because in high-dimensional spaces the the target v alues observ ed o v er the examples in that cell. But what ab out the num umb ber of configurations is going to be huge, muc much h larger than our num umb ber of cells for which we ha v e seen no example? Because in high-dimensional spaces examples, most configurations will hav havee no training example asso associated ciated withthe it. number of configurations is going to be huge, much larger than our number of examples, most configurations will hav155 e no training example associated with it.
CHAPTER 5. MACHINE LEARNING BASICS
Ho How w could we possibly say something meaningful ab about out these new configurations? Man Many y traditional machine learning algorithms simply assume that the output at a How pcould we possibly say something aboutput out these newnearest configurations? new oint should be approximately themeaningful same as the at the training Man y traditional machine learning algorithms simply assume that the output at a poin oint. t. new point should be approximately the same as the output at the nearest training point.
5.11.2
Lo Local cal Constancy and Smo Smoothness othness Regularization
In order toLo generalize well, mac machine hine algorithms need to be guided by prior 5.11.2 cal Constancy andlearning Smoothness Regularization beliefs ab about out what kind of function they should learn. Previously Previously,, we hav havee seen In order to generalize w ell, mac hine learning algorithms need to be guided by prior these priors incorp incorporated orated as explicit beliefs in the form of probability distributions about what kind of function they should Previously we hav e seen obveliefs er parameters of the mo model. del. More informally informally, , welearn. may also discuss ,prior beliefs as these priors incorp orated as explicit b eliefs in the form of probability distributions directly influencing the function itself and only indirectly acting on the parameters overtheir parameters thefunction. model. More informally e may also discuss discuss prior prior bbeliefs eliefs as as via effect onofthe Additionally dditionally, , w, ewinformally directly influencing the function itself and only indirectly theard parameters b eing expressed implicitly implicitly, , by choosing algorithms that areacting biasedontow toward choosing via their effect on the function. A dditionally , w e informally discuss prior b eliefs as some class of functions over another, even though these biases may not be expressed being expressed , byinchoosing thatdistribution are biased represen toward cting hoosing (or ev even en possibleimplicitly to express) terms ofalgorithms a probability representing our some class functions overfunctions. another, even though these biases may not be expressed degree of bof elief in various (or even possible to express) in terms of a probability distribution representing our Among the most widely used of these implicit “priors” is the smo smoothness othness prior degree of belief in various functions. or lo loccal constancy prior prior.. This prior states that the function we learn should not Among the most widely used of these implicit “priors” is the smoothness prior change very muc much h within a small region. or local constancy prior. This prior states that the function we learn should not Man Many simpler rely region. exclusiv exclusively ely on this prior to generalize well, and change vyery much algorithms within a small as a result they fail to scale to the statistical challenges inv involved olved in solving AIMan y simpler algorithms rely exclusiv ely on this prior to well, and lev level el tasks. Throughout this book, we will describ describee ho how w deep generalize learning introduces as a result (explicit they fail to to thepriors statistical challenges involved in solving AIadditional andscale implicit) in order to reduce the generalization level tasks. Throughout this bHere, ook, we describ howsmo deep learning introduces error on sophisticated tasks. we will explain wh why ye the smoothness othness prior alone is additional (explicit and implicit) priors in order to reduce the generalization insufficien insufficientt for these tasks. error on sophisticated tasks. Here, we explain why the smoothness prior alone is There are many differen differentt ways to implicitly or explicitly express a prior belief insufficient for these tasks. that the learned function should b e smooth or lo locally cally constan constant. t. All of these different There are many differen t w a ys to implicitly or explicitly express a prior elief f ∗bthat metho methods ds are designed to encourage the learning pro process cess to learn a function that the learned function should b e smooth or locally constant. All of these different satisfies the condition methods are designed to encourage the that f ∗ (x ) ≈learning f∗ (x + pro ) cess to learn a function f(5.103) satisfies the condition for most configurations x and small know w (5.103) a go goo od f (x)change f (x.+In ) other words, if we kno answ answer er for an input x (for example, if≈x is a lab labeled eled training example) then that x for most configurations and small change . In words, if we go kno a good answ answer er is probably go goood in the neigh neighb borho orhoood of x. other If we hav have e several goo o dwanswers x is athem answ er for an b input if bine labeled example) then that in some neigh neighb orho orhoo oxd(for we example, would com combine (b (by ytraining some form of av averaging eraging or answ erolation) is probably goduce od inanthe neighbthat orhoagrees od of xwith . If we e yseveral gooasd m answers in interp terp terpolation) to pro produce answer as hav man many of them uc uch h as inossible. some neighborhood we would combine them (by some form of averaging or p interpolation) to pro duce an answer that agrees with as many of them as much as An extreme example of the lo local cal constancy approac approach h is the k -nearest neighbors possible. 156 An extreme example of the local constancy approach is the k -nearest neighbors
CHAPTER 5. MACHINE LEARNING BASICS
family of learning algorithms. These predictors are literally constan constantt over eac each h region containing all the points x that hav havee the same set of k nearest neighbors in family of learning algorithms. These are literally constan t ovbe er more each the training set. For number berpredictors of distinguishable regions cannot k = 11,, the num x k regionthe containing all training the points that have the same set of nearest neighbors in than num numb ber of examples. the training set. For k = 1, the number of distinguishable regions cannot be more While the k-nearest neighbors algorithm copies the output from nearby training than the number of training examples. examples, most kernel mac machines hines interpolate betw between een training set outputs asso associated ciated k While the -nearest neighbors algorithm copies the output from nearby training with nearby training examples. An imp important ortant class of kernels is the family of lo loccal examples, most kernel mac hines interpolate betw een training set outputs asso ciated kernels where k(u, v ) is large when u = v and decreases as u and v gro grow w farther with nearby training examples. An imp ortant class of kernels is the family of local apart from eac each h other. A lo local cal kernel can be thought of as a similarity function (u, v ) is large v and decreases v gro kernels where ktemplate when u as u and w farther x that performs matching, by=measuring ho how w closely a test example apart from eac h other. A lo cal kernel can b e thought of as a similarity function ( i ) resem resembles bles each training example x . Muc Much h of the modern motiv motivation ation for deep x that p erforms template matching, by measuring ho w closely a test example learning is deriv derived ed from studying the limitations of lo local cal template matching and x resem bles each training example . Muc h of the modern motiv ation for deep ho how w deep mo models dels are able to succeed in cases where lo local cal template matching fails deriv ed from (learning Bengio etis al. , 2006b ). studying the limitations of local template matching and how deep models are able to succeed in cases where local template matching fails Decision also (Bengio et al.trees , 2006b ). suffer from the limitations of exclusively smoothness-based learning because they break the input space into as many regions as there are trees also suffer from the exclusively smoothness-based lea leav vDecision es and use a separate parameter (orlimitations sometimes of man many y parameters for extensions learning because break the input as many regions as there of decision trees) they in eac each h region. If thespace targetinto function requires a tree withare at lea v es and use a separate parameter (or sometimes man y parameters for extensions least n lea leav ves to be represented accurately accurately,, then at least n training examples are of decision trees) in eac h region. target to function a of tree with at required to fit the tree. A multiple ofIf nthe is needed ac achiev hiev hievee requires some level statistical n n least lea v es to b e represented accurately , then at least training examples are confidence in the predicted output. required to fit the tree. A multiple of n is needed to achieve some level of statistical In general, to distinguish O( k) regions in input space, all of these metho methods ds confidence in the predicted output. require O (k ) examples. Typically there are O( k) parameters, with O (1) parameters O)( kregions. In general, distinguish ) regions in case inputofspace, all of these metho ds asso associated ciated with to each of the O( k The a nearest neighbor scenario, O ( k O ( k O require ) examples. T ypically there are ) parameters, with (1) parameters where each training example can be used to define at most one region, is illustrated asso ciated each of the O( k ) regions. The case of a nearest neighbor scenario, in Fig. 5.10with . where each training example can be used to define at most one region, is illustrated Is there way y to represen representt a complex function that has many more regions in Fig. 5.10.a wa to be distinguished than the num umb ber of training examples? Clearly Clearly,, assuming Is there a wa y to represen t a complex function that has many more regions only smo smoothness othness of the underlying function will not allow a learner to do that. toorbeexample, distinguished than thethe num ber offunction trainingisexamples? , assuming F imagine that target a kind of Clearly chec checkerboard. kerboard. A only smo othness of the underlying function will not allow a learner to do that. chec heck kerb erboard oard con contains tains man many y variations but there is a simple structure to them. F or example, imagine that is a kind of chec A Imagine what happens whenthe thetarget num umb bfunction er of training examples is kerboard. substan substantially tially checkerbthan oardthe conntains man variations but there is on a simple structure to Based them. smaller um umb ber of yblac black k and white squares the chec checkerboard. kerboard. Imagine what happens when umbothness er of training substan tially on only lo local cal generalization andthe thensmo smoothness or local examples constancy is prior, we would smaller than the umber ofguess blackthe and white ont the b e guaranteed to n correctly color of asquares new p oin oint if itchec lieskerboard. within theBased same on only lo cal generalization and the smo othness or local constancy prior, w e would chec heck kerb erboard oard square as a training example. There is no guaran guarantee tee that the learner b e guaranteed to correctly guess the color of a new p oin t if it lies same could correctly extend the chec heck kerb erboard oard pattern to poin oints ts lying in within squaresthe that do checcon kerbtain oardtraining square examples. as a training example. There is no tee that the that learner not contain With this prior alone, theguaran only information an could correctly extend the checkerboard pattern to points lying in squares that do not contain training examples. With this 157prior alone, the only information that an
CHAPTER 5. MACHINE LEARNING BASICS
y y
158
CHAPTER 5. MACHINE LEARNING BASICS
example tells us is the color of its square, and the only wa way y to get the colors of the en entire tire chec heck kerb erboard oard right is to co cov ver eac each h of its cells with at least one example. example tells us is the color of its square, and the only way to get the colors of the The smo smoothness othness assumption and the associated non-parametric learning algoentire checkerboard right is to cover each of its cells with at least one example. rithms work extremely well so long as there are enough examples for the learning The smo associated non-parametric algoalgorithm toothness observeassumption high pointsand on the most peaks and lo low w poin oints ts on learning most valleys rithms workunderlying extremely well so long are enough thewhen learning of the true function to as be there learned. This is examples generallyfor true the algorithm to observe high p oints on most p eaks and lo w p oin ts on most v alleys function to be learned is smo smooth oth enough and varies in few enough dimensions. of the true underlying function beoth learned. This generally true when the In high dimensions, even a very to smo smooth function canischange smoothly but in a function to be learned is smo oth enough and v aries in few enough dimensions. differen differentt way along each dimension. If the function additionally behav ehaves es differently In different high dimensions, very smo oth function can change smoothly in ofa in regions, iteven can ab ecome extremely complicated to describ describe e withbut a set different examples. way along each the function (w additionally behaves differently training If thedimension. function isIfcomplicated (we e wan antt to distinguish a huge in different regions, it can b ecome extremely complicated to describ e with a set to of num umb ber of regions compared to the number of examples), is there any hope training examples. If the function is complicated (w e w an t to distinguish a h uge generalize well? number of regions compared to the number of examples), is there any hope to The answer to both of these questions is yes. The key insigh insightt is that a very generalize well? k large num umb ber of regions, e.g., O(2 ), can be defined with O (k) examples, so long The answer to both of these questions The via keyadditional insight is assumptions that a very as we introduce some dep dependencies endencies bet etw ween is theyes. regions O(2 ), can O (wa k) yexamples, large umbunderlying er of regions, e.g., be defined with so long ab about outnthe data generating distribution. In this way , we can actually as we introduce some endencies etween the, regions via additional assumptions generalize non-lo non-locally callydep (Bengio and bMonp Monperrus errus 2005; Bengio et al., 2006c ). Man Many y ab out the underlying data generating distribution. In this wa y , we can actually differen differentt deep learning algorithms pro provide vide implicit or explicit assumptions that are generalize non-lo cally ( Bengio and Monp , 2005 Bengio these et al.,adv 2006c ). Many reasonable for a broad range of AI tasks errus in order to ;capture advantages. antages. different deep learning algorithms provide implicit or explicit assumptions that are Other approac approaches hes to machine make e stronger, task-specific ecific asreasonable for a broad range of AIlearning tasks in often order mak to capture thesetask-sp advantages. sumptions. For example, we could easily solv solvee the chec heck kerb erboard oard task by pro providing viding Other approac hes to machine learning often mak e stronger, task-sp ecific asthe assumption that the target function is perio periodic. dic. Usually we do not include suc such h sumptions. F or example, w e could easily solv e the c hec k erb oard task by pro viding strong, task-sp task-specific ecific assumptions in into to neural netw networks orks so that they can generalize the assumption that the target function is perio dic. Usually we dothat not include to a muc uch h wider variety of structures. AI tasks ha hav ve structure is muc uch hsuc tooh strong, task-sp assumptions into neural netw orks soerties that they complex to be ecific limited to simple, manually sp specified ecified prop properties suc such h can as pgeneralize erio eriodicity dicity dicity,, to a m uc h wider v ariety of structures. AI tasks ha v e structure that is m uc h too so we wan antt learning algorithms that embo embody dy more general-purpose assumptions. complex be in limited simple, ismanually ecified that properties suchwas eriodicity, The core to idea deep to learning that we sp assume the data as pgenerated so w e w an t learning algorithms that embo dy more general-purpose assumptions. by the or features, p oten otentially tially at multiple lev levels els in a The core in deep that weassumptions assume thatcan thefurther data wimpro as generated hierarc hierarch hy.idea Many otherlearning similarlyisgeneric improv ve deep b y the or features, p oten tially at multiple lev els a learning algorithms. These apparen apparently tly mild assumptions allo allow w an exp exponen onen onential tial in gain hierarc hy. Many other generic assumptions canthe further e deep in the relationship bet etw wsimilarly een the num number ber of examples and num numb bimpro er of vregions learning algorithms. These apparen tly mild assumptions allo w an exp onen tial gain that can be distinguished. These exp exponential onential gains are describ described ed more precisely in in the6.4.1 relationship ween num ber exp of examples andan the num ber of regions Sec. , Sec. 15.4b, et and Sec.the 15.5 . The exponential onential adv advan antages tages conferred by the thatofcan be distinguished. These exponential gains describ ed more precisely in use deep, distributed representations coun counter ter theare exp exponen onen onential tial challenges posed Sec. 6.4.1 , Sec. 15.4, and Sec. . 15.5. The exponential advantages conferred by the b y the curse of dimensionality dimensionality. use of deep, distributed representations counter the exponential challenges posed by the curse of dimensionality. 159
CHAPTER 5. MACHINE LEARNING BASICS
5.11.3
Manifold Learning
An imp important ortant conceptLearning underlying man many y ideas in machine learning is that of a 5.11.3 Manifold manifold. An important concept underlying many ideas in machine learning is that of a A manifold is a connected region. Mathematically Mathematically,, it is a set of p oin oints, ts, associated manifold. with a neighborho d around each p oint. F rom an given p oint, the manifold lo neighborhooo any y locally cally A manifold is a connected region. Mathematically , it is a set of p oin ts, associated app appears ears to be a Euclidean space. In everyda everyday y life, we exp experience erience the surface of the with neighborho o d around oint.a F rom any manifold given point, the manifold w orlda as a 2-D plane, but it each is in pfact spherical in 3-D space. locally appears to be a Euclidean space. In everyday life, we experience the surface of the The definition of a neigh neighb borho orhoo od surrounding each poin ointt implies the existence world as a 2-D plane, but it is in fact a spherical manifold in 3-D space. of transformations that can be applied to mo mov ve on the manifold from one position definitionone. of aInneigh orhood surrounding each pointasimplies the existence to aThe neighboring the bexample of the world’s surface a manifold, one can of transformations that can b e applied to mo v e on the manifold from one position walk north, south, east, or west. to a neighboring one. In the example of the world’s surface as a manifold, one can Although there is a formal mathematical meaning to the term “manifold,” walk north, south, east, or west. in mac machine hine learning it tends to be used more lo loosely osely to designate a connected Although there is a formal mathematical meaning to the “manifold,” set of poin oints ts that can be appro approximated ximated well by considering onlyterm a small num umber ber in mac hine learning it tends to b e used more lo osely to designate a connected of degrees of freedom, or dimensions, embedded in a higher-dimensional space. set of dimension points thatcorresp can bonds e appro well by considering small Eac Each h corresponds to ximated a lo local cal direction of variation.only See aFig. 5.11num forber an of degrees freedom, dimensions, embedded in amanifold higher-dimensional example of of training dataorlying near a one-dimensional em embedded bedded inspace. twoEach dimension corresp onds to a of local direction of variation. 5.11 for any dimensional space. In the context mac machine hine learning, we allowSee theFig. dimensionalit dimensionality example of training lying one nearp oin aoint one-dimensional manifold bedded in twoof the manifold to vdata ary from t to another. This often em happ happens ens when a dimensional space. In the context of mac hine learning, w e allow the dimensionalit manifold in intersects tersects itself. For example, a figure eigh eightt is a manifold that has a singley of the manifold to places vary from one p oint to another. This often happ enscenter. when a dimension in most but tw at the in at the two o dimensions intersection tersection manifold intersects itself. For example, a figure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.
160
CHAPTER 5. MACHINE LEARNING BASICS
Man Many y mac machine hine learning problems seem hop hopeless eless if we exp expect ect the machine learning algorithm to learn functions with interesting variations across all of y macle hine learning problems seem hopobstacle eless if wbey exp ect thethat machine Manifold learning arning algorithms surmoun surmount t this assuming most Rn. Man learning algorithm to learn functions with interesting v ariations across all of n of inv valid inputs, and that in interesting teresting inputs o ccur only along R R consists of in . Manifold le arning algorithms surmoun t this obstacle b y assuming that most a collection of manifolds con containing taining a small subset of poin oints, ts, with interesting R consists invalidofinputs, and that interesting inputs ccur only along vofariations in theofoutput the learned function occurring onlyoalong directions a collection of manifold, manifolds or con taining a smallvariations subset of happ poinening ts, with interesting that lie on the with interesting happening only when we vmo ariations in the output of the learned function o ccurring only along directions mov ve from one manifold to another. Manifold learning was introduced in the case that lie on the manifold, with interesting variations happ ening only whenthis we of con continuous-v tinuous-v tinuous-valued alued dataorand the unsup unsupervised ervised learning setting, although move fromyone manifold toidea another. learning was introduced in the probabilit probability concentration can bManifold e generalized to both discrete data andcase the of con tinuous-v alued data and the unsup ervised learning setting, although this sup supervised ervised learning setting: the key assumption remains that probability mass is probabilit y concentration idea can be generalized to both discrete data and the highly concen concentrated. trated. supervised learning setting: the key assumption remains that probability mass is The assumption that the data lies along a low-dimensional manifold may not highly concentrated. alw alwa ays be correct or useful. We argue that in the context of AI tasks, suc such h as The assumption that the data lies along a low-dimensional manifold may those that inv involve olve pro processing cessing images, sounds, or text, the manifold assumptionnot is alw a ys b e correct or useful. W e argue that in the context of AI tasks, suc h as at least appro approximately ximately correct. The evidence in fav favor or of this assumption consists those that inv olve pro cessing images, sounds, or text, the manifold assumption is of two categories of observ observations. ations. at least approximately correct. The evidence in favor of this assumption consists The first observ observation ation in fav favor or of the manifold hyp hypothesis othesis is that the probability of two categories of observations. distribution over images, text strings, and sounds that occur in real life is highly Thetrated. first observ ationnoise in favessentially or of the manifold hypothesisstructured is that theinputs probability concen concentrated. Uniform nev never er resembles from distribution o v er images, text strings, and sounds that o ccur in real life is highly these domains. Fig. 5.12 sho shows ws ho how, w, instead, uniformly sampled points lo look ok like the concentrated. Uniform noise never resembles structured from patterns of static that app appear ear essentially on analog television sets when no signalinputs is available. these domains. 5.12 sho wscumen how, tinstead, uniformly points look likewhat the Similarly Similarly, , if youFig. generate a do documen cument by picking letters sampled uniformly at random, patterns of staticy that onget analog television English-language sets when no signal is avAlmost ailable. is the probabilit probability that app youear will a meaningful text? Similarly , if you generate a do cumen t by picking letters uniformly at random, zero, again, because most of the long sequences of letters do not corresp correspond ondwhat to a is the probabilit that you the willdistribution get a meaningful English-language text?oAlmost natural languageysequence: of natural language sequences ccupies ecause most the space long sequences of letters do not correspond to a azero, veryagain, small bvolume in theoftotal of sequences of letters. natural language sequence: the distribution of natural language sequences occupies a very small volume in the total space of sequences of letters.
161
CHAPTER 5. MACHINE LEARNING BASICS
Of course, concentrated probabilit probability y distributions are not sufficien sufficientt to sho show w that the data lies on a reasonably small number of manifolds. We must also Of course, concentrated probabilit y distributions are to not sufficien to other show establish that the examples we encounter are connected eac each h othert by that the data lies on a reasonably small number of manifolds. We must also establish that the examples we encounter 162 are connected to each other by other
CHAPTER 5. MACHINE LEARNING BASICS
examples, with each example surrounded by other highly similar examples that ma may y be reached by applying transformations to trav traverse erse the manifold. The second examples, with each example surrounded by other highly similar that argumen argumentt in fa fav vor of the manifold hyp ypothesis othesis is that we can alsoexamples imagine suc such h ma y b e reached b y applying transformations to trav erse the manifold. The second neigh neighb borho orhoo ods and transformations, at least informally informally.. In the case of images, we argumen t in fa v or of the manifold h yp othesis is thatthat we allo canwalso imagine such can certainly think of many possible transformations allow us to trace out a neighborho and space: transformations, at leastdim informally . In the of images, we manifold inods image we can gradually or brighten thecase lights, gradually canvecertainly of many thatcolors allowon usthe to trace outofa mo mov or rotatethink ob objects jects in thepossible image, transformations gradually alter the surfaces manifold in image space:lik w e can or brighten the inv lights, gradually ob objects, jects, etc. It remains likely ely thatgradually there aredim multiple manifolds involved olved in most mo v e or rotate ob jects in the image, gradually alter the colors on the surfaces applications. For example, the manifold of images of human faces may not bofe ob jects, etc. likely that there arefaces. multiple manifolds involved in most connected to It theremains manifold of images of cat applications. For example, the manifold of images of human faces may not b e These thought exp experiments eriments supp supporting orting the manifold hypotheses conv convey ey some inconnected to the manifold of images of cat faces. tuitiv tuitivee reasons supp supporting orting it. More rigorous exp experimen erimen eriments ts ((Ca Ca Cayton yton, 2005; Nara Naray yanan These thought exp eriments supp orting the manifold hypotheses conv ey some in-, and Mitter, 2010; Sc Schölk hölk hölkopf opf et al. al.,, 1998; Ro Row weis and Saul, 2000; Tenen enenbaum baum et al. al., tuitiv reasons supp orting and it. More rigorous erimen ts (Grimes Cayton, ,2003 2005; ;W Nara yanan 2000 ; eBrand , 2003 ; Belkin Niy Niyogi ogi , 2003; exp Donoho and einberger and Mitter 2010) ; clearly Schölkopf etort al., the 1998h;yp Ro weis and , 2000class ; Tenen et al. and Saul, ,2004 supp support ypothesis othesis forSaul a large of baum datasets of, 2000 ; Brand , 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger in interest terest in AI. and Saul, 2004) clearly support the hypothesis for a large class of datasets of When the data lies on a low-dimensional manifold, it can be most natural interest in AI. for mac machine hine learning algorithms to represent the data in terms of coordinates on When the rather data lies oninaterms low-dimensional manifold, it can be ymost the manifold, than of co coordinates ordinates in R n . In ev everyda eryda eryday life, natural we can for mac hine learning algorithms to represent the data in terms of coordinates think of roads as 1-D manifolds embedded in 3-D space. We give directions on to R the manifold, rather than in terms of co ordinates in . In ev eryda y life, we can sp specific ecific addresses in terms of address num umb bers along these 1-D roads, not in terms think of roads in as3-D 1-Dspace. manifolds embedded 3-D space. We giveisdirections to of co coordinates ordinates Extracting theseinmanifold co coordinates ordinates challenging, specific addresses in terms of address nyum bers along thesealgorithms. 1-D roads, not terms but holds the promise to impro improv ve man many machine learning Thisingeneral of coordinates in 3-D Extracting these coordinates challenging, principle is applied in space. man many y con contexts. texts. Fig. 5.13manifold shows the manifold is structure of a but holds the promise to impro v e man y machine learning algorithms. This general dataset consisting of faces. By the end of this book, we will ha hav ve dev develop elop eloped ed the principle applied in y con texts. Fig. 5.13 shows theInmanifold metho methods ds is necessary to man learn suc such h a manifold structure. Fig. 20.6structure , we will ofseea dataset consisting of faces. By thecan endsuccessfully of this book, we will ha ve dev eloped the ho how w a machine learning algorithm accomplish this goal. methods necessary to learn such a manifold structure. In Fig. 20.6, we will see concludes Part Ialgorithm , which has provided the basic concepts mathematics howThis a machine learning can successfully accomplish thisingoal. and mac machine hine learning which are emplo employ yed throughout the remaining parts of the This concludes art I, which has provided in mathematics book. You are no now wPprepared to embark up upon on ythe ourbasic studyconcepts of deep learning. and machine learning which are employed throughout the remaining parts of the book. You are now prepared to embark upon your study of deep learning.
163
CHAPTER 5. MACHINE LEARNING BASICS
164
Part II Part II
Deep Net Netw works: Mo Modern dern Practices Deep Networks: Modern Practices
165 165
This part of the book summarizes the state of mo modern dern deep learning as it is used to solv solvee practical applications. This part of the book summarizes the state of modern deep learning as it is Deep learning has a long history and man many y aspirations. Sev Several eral approac approaches hes used to solve practical applications. ha hav ve been proposed that hav havee yet to en entirely tirely bear fruit. Sev Several eral am ambitious bitious goals Deep learning has a long history and man y aspirations. Sev eral ha hav ve yet to be realized. These less-developed branches of deep learningapproac appearhes in havefinal beenpart proposed that the of the bo book. ok.have yet to entirely bear fruit. Several ambitious goals have yet to be realized. These less-developed branches of deep learning appear in This part focuses only on those approac approaches hes that are essen essentially tially working tec techhthe final part of the book. nologies that are already used hea heavily vily in industry industry.. This part focuses only on those approaches that are essentially working techMo Modern dern learningused pro provides vides pow powerful erful framew framework ork for supervised nologies thatdeep are already hea vilya invery industry . learning. By adding more lay layers ers and more units within a la lay yer, a deep netw network ork can Mo dern deep learning pro vides a v ery pow erful framew ork for supervised represen representt functions of increasing complexity complexity.. Most tasks that consist of mapping an learning. By adding more lay ers within yer, a to deep ork, can can input vector to an output vector,and andmore that units are easy for aa la person do netw rapidly rapidly, represen t functions increasing complexity . Most tly tasks thatmo consist of mapping an b e accomplished viaofdeep learning, giv large sufficiently given en sufficien sufficiently models dels and input datasets vector toofanlabeled outputtraining vector, examples. and that are easytasks, for a that person dobrapidly , can large Other cantonot e described be asso accomplished viavector deep learning, given sufficien large mo dels and as associating ciating one to another, or that aretly difficult enough thatsufficiently a p erson large datasets of labeled training examples. Other tasks, that can not b e described would require time to think and reflect in order to accomplish the task, remain asey asso one to another, orw. that are difficult enough that a p erson b eyond ondciating the scop scope e ofvector deep learning for no now. would require time to think and reflect in order to accomplish the task, remain This part of the book describ describes es the core parametric function approximation beyond the scope of deep learning for now. tec technology hnology that is behind nearly all modern practical applications of deep learning. the book the describ es theard core parametric approximation We This beginpart b by y of describing feedforw feedforward deep net network workfunction mo model del that is used to tec hnology that is b ehind nearly all modern practical applications of deep learning. represen representt these functions. Next, we present adv advanced anced tec techniques hniques for regularization We optimization begin by describing the feedforw ardthese deep net work model thatsuch is used to and of such mo models. dels. Scaling mo models dels to large inputs as high represent these functions. Next, wesequences present adv anced tec hniques for regularization resolution images or long temporal requires sp specialization. ecialization. We in intro tro troduce duce and optimization of such mo dels. Scaling these mo dels to large inputs such as high the con conv volutional net network work for scaling to large images and the recurren recurrentt neural resolution or long temporal sequencesFinally requires We guidelines introduce net pro temp e ecialization. present general netw work forimages processing cessing temporal oral sequences. Finally, , wsp the the conpractical volutionalmetho network forin scaling large images and the recurren t neural for methodology dology inv volved to in designing, building, and configuring an net w ork for pro cessing temp oral sequences. Finally , w e present general guidelines application in involving volving deep learning, and review some of the applications of deep for the practical methodology involved in designing, building, and configuring an learning. application involving deep learning, and review some of the applications of deep These chapters are the most imp important ortant for a practitioner—someone who wants learning. to begin implemen implementing ting and using deep learning algorithms to solv solvee real-w real-world orld These c hapters are the most imp ortant for a practitioner—someone who w ants problems to toda da day y. to begin implementing and using deep learning algorithms to solve real-world problems today.
166
Chapter 6 Chapter 6
Deep Feedforw eedforward ard Net Netw works Deep F eedforw ard Net w orks De Deep ep fe feeedforwar dforward d networks, also often called fe feeedforwar dforward d neur neural al networks, or multilayer per ercceptr eptrons ons (MLPs), are the quintessen quintessential tial deep learning mo models. dels. The goal Deep feedforward netw networks often called fesome edforwar d neurfal∗ .networks , or multiof a feedforward network ork ,isalso to approximate function F For or example, for layer p er c eptr ons ( MLPs ), are the quintessen tial deep learning mo dels. The goal ∗ a classifier, y = f (x) maps an input x to a category y. A feedforward netw network ork f of a feedforward netw ork is to approximate some function . F or example, for defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result y = f (xapproximation. a classifier, ) maps an input x to a category y. A feedforward network in the b est function defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result These models dels areapproximation. called fe feeedforwar dforward d b ecause information flo flows ws through the in the b estmo function function b eing ev evaluated aluated from x, through the intermediate computations used to These mo dels aretocalled feedforwar d b ecause flows through the y. There define f , and finally the output are noinformation fe feeedb dback ack connections in whic which h x function of b eing ev aluated from , through the intermediate computations used to outputs the mo are fed back in itself. When feedforward neural netw model del into to networks orks f , and finally define to the output . There are no feeare dback connections in neur whical h are extended to include feedbac feedback ky connections, they called recurr current ent neural outputs the mo delinare fed back networks networks,of , presented Chapter 10.into itself. When feedforward neural networks are extended to include feedback connections, they are called recurrent neural Feedforw eedforward ard netw networks orks are of extreme imp importance ortance to machine learning practinetworks, presented in Chapter 10. tioners. They form the basis of many important commercial applications. For Feedforw netw orks are of extreme impob ortance to machine learning example, theard conv convolutional olutional net networks works used for object ject recognition from photospractiare a tioners. They form the basis of many important commercial applications. For sp specialized ecialized kind of feedforw feedforward ard net network. work. Feedforw eedforward ard netw networks orks are a conceptual example, the conv olutional net works used for ob ject recognition from photos are stepping stone on the path to recurren recurrentt netw networks, orks, which p ower man many y naturala sp ecialized kind of feedforward network. Feedforward networks are a conceptual language applications. stepping stone on the path to recurrent networks, which p ower many natural Feedforw eedforward ard neural net networks works are called networks b ecause they are typically replanguage applications. resen resented ted by comp composing osing together many different functions. The mo model del is asso associated ciated F eedforw ard neural net works are called networks b ecause they are t ypically repwith a directed acyclic graph describing how the functions are comp composed osed together. resen ted by comp many differentffunctions. Thefmo (1) f (2) (3) del is asso ciated F or example, we osing migh mightttogether hav havee three functions , , and connected in a directed (2) (f (1) (x how the functions are comp osed together. (x ) = graph f (3) (f describing cwith hain,a to form facyclic ))) ))).. These chain structures are the most f case, f is connected For example, westructures might hav three functions , andf (1) in a commonly used ofe neural netw networks. orks.fIn ,this called the first f ( x ) = f ( f ( f ( x clayer hain,oftothe form ))) . These chain structures are the most netw network, ork, f (2) is called the se seccond layer layer,, and so on. The overall length commonly used structures of neural networks. In this case, f is called the first 167 layer of the network, f is called the se cond layer, and so on. The overall length 167
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of the chain giv gives es the depth of the mo model. del. It is from this terminology that the name “deep learning” arises. The final lay layer er of a feedforward net network work is called the of the c hain giv es the depth of the mo del. It is from this terminology the output layer layer.. During neural netw network ork training, we drive f(x) to match f ∗ (that x). The name “deep The final layer of a feedforward isaluated called the f ∗work (x) ev training datalearning” providesarises. us with noisy noisy, , approximate examples ofnet evaluated at output layer . During neural netw ork training, we drive ) to match ) . The f ( x f ( x ∗ differen differentt training p oints. Eac Each h example x is accompanied by a lab label el y ≈ f (x). f (do x) at training dataexamples provides sp usecify withdirectly noisy, approximate examples evaluated att The training specify what the output la layer yer of must each p oin oint x yis. accompanied f (xis differen t training h example a lab el y la ). x ; it must pro produce ducep oints. a valueEac that is close to The b ehavior by of the other layers yers The directly training sp examples sp ecify directly what output layeralgorithm must do at each p oint ≈ decide not specified ecified by the training data. the The learning must x y ; it m ust pro duce a v alue that is close to . The b ehavior of the other la yers is ho how w to use those lay layers ers to pro produce duce the desired output, but the training data do does es not sa directly ecified by the lay training data. learning algorithm must decide not say y whatsp eac each h individual layer er should do. The Instead, the learning algorithm must ho w to use those lay ers to pro duce the desired output, but the training data do es ∗ decide how to use these lay layers ers to b est implement an approximation of f . Because not training say whatdata each do individual lay should do. output Instead, learning algorithm must the does es not sho show werthe desired forthe each of these lay layers, ers, these f decide how to use these lay ers to b est implement an approximation of . Because la layers yers are called hidden layers. the training data do es not show the desired output for each of these layers, these Finally Finally,, these netw networks orks are called neur neural al b ecause they are lo loosely osely inspired by layers are called hidden layers. neuroscience. Eac Each h hidden lay layer er of the net network work is typically vector-v ector-valued. alued. The Finally , these netw orks are called neur al b ecause they are lo osely inspired by dimensionalit dimensionality y of these hidden la layers yers determines the width of the mo model. del. Each neuroscience. Eac h hidden lay er of the net work is t ypically v ector-v alued. The elemen elementt of the vector ma may y b e in interpreted terpreted as playing a role analogous to a neuron. dimensionalit y of these hidden yers determines the width of the mo del. Each Rather than thinking of the lay layer erlaas represen representing ting a single vector-to-vector function, elemen of the vector b eerinterpreted as playing ay role analogous neuron. w e cant also think of ma theylay layer as consisting of man many units that acttoinaparallel, Rather than thinking of the layer as represen tingEac a single function, eac each h representing a vector-to-scalar function. Each h unitvector-to-vector resem resembles bles a neuron in w e can also think of the lay er as consisting of man y units that act in parallel, the sense that it receives input from many other units and computes its own eac h ation representing a vector-to-scalar function. unit resem bles a neuron in activ activation value. The idea of using man many y lay layers ersEac of hvector-v ector-valued alued representation thedrawn sensefrom thatneuroscience. it receives input unitsf (and i) (x computes its own is The from choicemany of theother functions ) used to compute activ ation v alue. The idea of using man y lay ers of v ector-v alued representation these representations is also lo loosely osely guided by neuroscien neuroscientific tific observ observations ations ab about out f ( x is drawn from neuroscience. The choice of the functions ) used to compute the functions that biological neurons compute. How Howev ev ever, er, mo modern dern neural netw network ork these representations alsoy lo osely guided b y neuroscien tificdisciplines, observations abthe out researc research h is guided by isman many mathematical and engineering and the functions biological compute. er, moIt dern network goal of neural that netw networks orks is notneurons to p erfectly mo model delHow theevbrain. is bneural est to think of researc h is guided b y man y mathematical and engineering disciplines, and the feedforw feedforward ard net netw works as function approximation machines that are designed to goal of neural netw orks is not to opccasionally erfectly modrawing del the brain. It is b est to what think we of ac achieve hieve statistical generalization, some insights from feedforw ard the netwbrain, orks as function machines that are designed to kno rather than approximation as mo function. know w ab about out models dels of brain achieve statistical generalization, o ccasionally drawing some insights from what we One wa way y to understand feedforward netw networks orks is to b egin with linear mo models dels know ab out the brain, rather than as mo dels of brain function. and consider ho how w to ov overcome ercome their limitations. Linear mo models, dels, suc such h as logistic One wa y to understand feedforward netw orks is to b egin with linear mo dels regression and linear regression, are app appealing ealing b ecause they may b e fit efficien efficiently tly and consider ho w to ov ercome their limitations. Linear mo dels, suc h as logistic and reliably either in closed form or with conv optimization. Linear mo also reliably,, convex ex models dels regression and linear regression, appcapacit ealing byecause theytomay b e functions, fit efficiently ha have ve the obvious defect that the are mo model del capacity is limited linear so and reliably , either in closed form or with conv ex optimization. Linear mo dels also the mo model del cannot understand the in interaction teraction b et etw ween an any y two input variables. have the obvious defect that the mo del capacity is limited to linear functions, so To extend linear mo models dels to represen representt nonlinear functions of x, we can apply the mo del cannot understand the interaction b etween any two input variables. the linear mo model del not to x itself but to a transformed input φ(x), where φ is a To extend linear mo dels to represent nonlinear functions of x, we can apply the linear mo del not to x itself but to a transformed input φ(x), where φ is a 168
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
nonlinear transformation. Equiv Equivalently alently alently,, we can apply the kernel tric trick k describ described ed in Sec. 5.7.2, to obtain a nonlinear learning algorithm based on implicitly applying nonlinear transformation. Equiv , we can apply kernel tric k describxed in the We can think ofalently a set the of features describing , or φ mapping. φ as providing Sec. 5.7.2, toaobtain a nonlinear learning as providing new representation for x. algorithm based on implicitly applying the φ mapping. We can think of φ as providing a set of features describing x, or The question is then how to choose the mapping φ. as providing a new representation for x. question is to then to choose mapping φ, such 1.The One option is usehow a very genericthe as theφ.infinite-dimensional φ that is implicitly used by kernel machines based on the RBF kernel. If φ(x ) is 1. of One option is to dimension, use a very generic , such the infinite-dimensional high enough we canφalw always ays as hav have e enough capacity to φ fitthat the φ ( x is implicitly used by kernel machines based on the RBF kernel. If ) is training set, but generalization to the test set often remains p oor. Very of high enough dimension,are we usually can alwbased ays hav e enough the generic feature mappings only on thecapacity principletooffitlo local cal training set, but generalization to the test set often remains p oor. V ery smo smoothness othness and do not enco encode de enough prior information to solve adv advanced anced generic feature mappings are usually based only on the principle of lo cal problems. smo othness and do not enco de enough prior information to solve advanced 2. Another adventt of deep learning, problems.option is to manually engineer φ. Until the adven this was the dominant approach. This approach requires decades of human φ. Until the sp 2. effort Another to manually adven t of deep foroption eac separate task, engineer with practitioners in learning, different each h is specializing ecializing this was the dominant approach. This approach requires decades of domains such as speech recognition or computer vision, and withhuman little effort for eac h separate task, with practitioners sp ecializing in different transfer b etw etween een domains. domains such as speech recognition or computer vision, and with little 3. The strategy learning is to learn φ. In this approach, we ha have ve a mo model del transfer b etwof eendeep domains. > now w hav havee parameters θ that we use to learn y = f (x ; θ , w ) = φ(x; θ) w. We no φ. In this approach, 3. φ The strategy of deep learning is to learn we from have aφ(mo w that map x )del from a broad class of functions, and parameters to . Wexample e now hav use to learn y = desired f (x ; θ , w ) = φ(xThis ; θ) iswan θ that the output. ofe aparameters deep feedforw feedforward ardwe net network, work, with φ defining w that φ( xthat from a broad class lay of er. functions, and parameters from ) to φ a hidden layer. This approach is the only onemap of the three the This a deep feedforw network, with giv gives esdesired up on output. the conv convexit exit exity y isofan theexample trainingofproblem, but the ard b enefits outw outweigh eigh φ defining a hidden lay er. This approach is the only one of the three that the harms. In this approac approach, h, we parametrize the represen representation tation as φ(x; θ) givesuse up the on the convexity algorithm of the training problem, butcorresp the b enefits eigh and optimization to find the θ that corresponds onds tooutw a go goo od φ ( x ; θ) the harms. In this approac h, w e parametrize the represen tation as represen representation. tation. If we wish, this approach can capture the b enefit of the first and use hthe to find onds to afamily go o d θ that approac approach byoptimization b eing highlyalgorithm generic—we do the so by usingcorresp a very broad represen tation. If we wish, this approach can capture the b enefit of the first φ(x; θ ). This approac approach h can also capture the b enefit of the second approac approach. h. approac by b eing highly generic—we do so by using a very broad family Human hpractitioners can enco their knowledge to help generalization by encode de φ(x; θ ). This approac also capture b enefit of the approac φ(x ;hθcan designing families ) that they exp expect ectthe will p erform well.second The adv advan an antage tageh. Human practitioners can enco de their to righ helpt generalization by is that the human designer only needsknowledge to find the right general function φ ( x ; θ designing families ) that they exp ect will p erform well. The adv an tage family rather than finding precisely the right function. is that the human designer only needs to find the right general function family rather than finding precisely the right function. This general principle of improving mo models dels by learning features extends b ey eyond ond the feedforward net networks works described in this chapter. It is a recurring theme of deep This general principle by learning features extends b ey learning that applies to allofofimproving the kindsmo of dels mo models dels describ described ed throughout this b oond ok. the feedforward net works described in this chapter. It is a recurring theme of deep Feedforward netw networks orks are the application of this principle to learning deterministic learning that applies to all of the kinds of mo dels describ ed throughout this b o ok. 169of this principle to learning deterministic Feedforward networks are the application
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
mappings from x to y that lack feedback connections. Other mo models dels presen presented ted later will apply these principles to learning sto stocchastic mappings, learning functions x tolearning y that probability mappings fromand lack feedback connections. movector. dels presented with feedback, distributions ov over erOther a single later will apply these principles to learning sto chastic mappings, learning functions e b egin this with a simple distributions example of a ov feedforward network. work. Next, withWfeedback, andchapter learning probability er a single net vector. we address each of the design decisions needed to deplo deploy y a feedforward netw network. ork. Wetraining b egin this chapterard withnetw a simple example of a feedforward netsame work.design Next, First, a feedforw feedforward network ork requires making many of the we addressaseach of the design needed deplo a feedforward ork. decisions are necessary for decisions a linear mo model: del: cto ho hoosing osingythe optimizer, netw the cost First, training a form feedforw ardoutput network requires making many of of the same design function, and the of the units. We review these basics gradient-based decisions as are necessary for a linear mo del: c ho osing the optimizer, cost learning, then pro proceed ceed to confront some of the design decisions that arethe unique function, and the form of the output units. Works e review basics gradient-based to feedforward net networks. works. Feedforward net netw w ha have vethese in intro tro troduced duced of the concept of a learning, then pro ceed to confront some of the design decisions that are unique hidden lay layer, er, and this requires us to cho hoose ose the activation functions that will b e to feedforward net works. F eedforward net w orks ha ve in tro duced the concept ofofa used to compute the hidden lay layer er values. We must also design the architecture hidden lay er, and this requires us to c ho ose the activation functions that will be the netw network, ork, including how man many y lay layers ers the netw network ork should contain, ho how w these used to compute hidden layer alues. We must also the architecture of net networks works should the b e connected to veac each h other, and ho how w design many units should b e in the ork, including how man y lay ersorks the requires network computing should contain, how these eac each hnetw la layer. yer. Learning in deep neural netw networks the gradients of net works should b e connected to eac h other, and ho w many units should b e in complicated functions. We present the back-pr ack-prop op opagation agation algorithm and its mo modern dern eac h la yer. Learning in deep neural netw orks requires computing the gradients of, generalizations, which can b e used to efficien efficiently tly compute these gradients. Finally Finally, complicated We present the back-pr opagation algorithm and its mo dern w e close withfunctions. some historical p ersp erspectiv ectiv ective. e. generalizations, which can b e used to efficiently compute these gradients. Finally, we close with some historical p ersp ective.
6.1
Example: Learning XOR
T o mak make e the idea of aLearning feedforw feedforward ardXOR netw network ork more concrete, we b egin with an 6.1 Example: example of a fully functioning feedforward net netw work on a very simple task: learning T o mak e the idea of a feedforw ard netw ork more concrete, we b egin with an the XOR function. example of a fully functioning feedforward network on a very simple task: learning function (“exclusiv (“exclusivee or”) is an op operation eration on two binary values, x 1 the The XORXOR function. and x2. When exactly one of these binary values is equal to 1, the XOR function The 1XOR functionit (“exclusiv or”) XisOR an function op eration on twothe binary x returns . Otherwise, returns 0.eThe provides targetvalues, function theseOur binary values is equal to 1, they XOR yand = fx∗(. xWhen = f ( xfunction ; θ) and ) that exactly we wantone to of learn. mo model del pro provides vides a function returns 1 . Otherwise, it returns 0. The X OR function provides the target our learning algorithm will adapt the parameters θ to make f as similar as function p ossible y = ∗f ( x) that we want to learn. Our mo del provides a function y = f ( x; θ ) and to f . our learning algorithm will adapt the parameters θ to make f as similar as p ossible to fIn. this simple example, we will not b e concerned with statistical generalization. [0,, 0]> , [0 , 1]> , X = { [0 We wan wantt our netw network ork to p erform correctly on the four p oin oints ts tsX In this simple example, we will not b e concerned with statistical [1 [1,, 0]>, and [1 [1,, 1]>} . W Wee will train the netw network ork on all four ofX thesegeneralization. p oin oints. ts. The = [0 , 0] W e wan t our netw ork to p erform correctly on the four p oin ts , [0 , 1] , only challenge is to fit the training set. [1, 0] , and [1, 1] . We will train the network on all four of these { p oints. The We can treat this problem as a regression problem and use a mean squared error only challenge is to} fit the training set. loss function. We choose this loss function to simplify the math for this example Wuch e can this problem a regression problem andother, use a mean error as m astreat p ossible. We willassee later that there are moresquared appropriate loss function. We choose this loss function to simplify the math for this example as much as p ossible. We will see later that there are other, more appropriate 170
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac approaches hes for mo modeling deling binary data. Ev Evaluated aluated on our whole training set, the MSE loss function is approaches for mo deling binary data. 1 Xset,∗ the MSE loss 2function is Evaluated on our whole training J (θ ) = (f (x) − f (x; θ )) . 4 1 x∈X J (θ ) = (f (x) f (x; θ )) . 4 No Now w we must cho hoose ose the form of our mo model, del, Suppose ose that we − f (x; θ ). Supp a linear mo model, del, with θ consisting of w and b. Our mo model del is defined to b e Now we must cho ose the form of our mo del, f (x; θ ). Supp ose that we X a linear mo del, with θ consisting f (x;of w ,wb)and = xb>.wOur + b.mo del is defined to b e
(6.1) (6.1) cho hoose ose cho ose (6.2)
f (xform ; w , b)with = x resp w +ect b. to w and b using the normal (6.2) We can minimize J(θ ) in closed respect equations. We can minimize J(θ ) in closed form with resp ect to w and b using the normal After solving the normal equations, we obtain w = 0 and b = 12 . The linear equations. mo model del simply outputs 0.5 everywhere. Wh Why y do does es this happ happen? en? Fig. 6.1 shows how w = 0 = y .toThe After solving the normal equations, we obtain linear a linear mo model del is not able to represent the XOR function. and Onebwa way solve this mo del simply 0.del 5 everywhere. y do es tthis happ en? Fig. 6.1 shows how problem is to outputs use a mo model that learnsWh a differen different feature space in which a linear a linear deltoisrepresent not able to mo model del ismo able therepresent solution. the XOR function. One way to solve this problem is to use a mo del that learns a different feature space in which a linear Sp Specifically ecifically ecifically, , we will in intro tro troduce a very simple feedforward netw network ork with one mo del is able to represent theduce solution. hidden la layer yer containing two hidden units. See Fig. 6.2 for an illustration of ecifically , wefeedforward will intro duce very feedforward orkh with thisSp mo model. del. This net netw waork hassimple a vector of hiddennetw units that one are hidden la yer containing t w o hidden units. See Fig. 6.2 for an illustration of (1) computed by a function f (x; W , c). The values of these hidden units are then h this mo del. This feedforward net w ork has a v ector of hidden units that are used as the input for a second lay layer. er. The second lay layer er is the output lay layer er of the f ( x ; W , c computed by a function ) . The v alues of these hidden units are net network. work. The output lay layer er is still just a linear regression mo model, del, but no now w then it is used as to thehinput a second layer.netw Theork second layer is the layer of the applied ratherforthan to x. The network now contains twooutput functions chained network. h The lay is still mo del, but nowb eing it is (1) (x; W = foutput , cer y =just f (2)a(hlinear ; w , b ),regression together: ) and with the complete mo model del h x applied to rather than to . The netw ork now contains t wo functions chained f (x; W , c, w , b) = f (2)(f (1)(x)). together: h = f (x; W , c)(1) and y = f (h; w , b ), with the complete mo del b eing f What function should Linear mo models dels ha have ve serv served ed us well so far, f (x; W , c, w , b) = f (f (x))compute? . (1) and it may b e tempting to make f b e linear as well. Unfortunately Unfortunately,, if f(1) were What function should f net compute? delsremain have serv ed us function well so far, linear, then the feedforward network work as aLinear whole mo would a linear of and it may b e tempting to make b e linear as well. Unfortunately , if w ere f f > (1) its input. Ignoring the intercept terms for the momen moment, t, supp suppose ose f (x ) = W x linear, then the feedforward net work as a would remain a tlinear function as of >w > >whole (2) f ( h ) = h f ( x ) = w W x and . Then . We could represen represent this function fits(xinput. ) = x>Ignoring w 0 wherethe w 0 intercept = W w. terms for the moment, supp ose f (x ) = W x and f (h) = h w. Then f( x) = w W x. We could represent this function as Clearly, must use a nonlinear function to describ describee the features. Most neural f (xClearly ) = x ,wwewhere w = W w. net networks works do so using an affine transformation con controlled trolled by learned parameters, Clearly , w e must use a nonlinear function to describ e thefunction. features. W Most neural follo followed wed by a fixed, nonlinear function called an activ activation ation e use that networkshere, do soby using an affine controlled bvides y learned parameters, h = gtransformation ( W >x + c) , where W pro strategy defining provides the weigh weights ts of a followed by a fixed, nonlinear function an activ function. Wregression e use that linear transformation and c the biases. called Previously Previously, , to ation describ describe e a linear h = g ( W x + c ) , W strategy here, by defining where pro vides the weigh ts ofan a mo model, del, we used a vector of weigh weights ts and a scalar bias parameter to describe linear transformation and c the biases. Previously, to describ e a linear regression 171a scalar bias parameter to describe an mo del, we used a vector of weights and
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Original x Space
Learned h Space
Original x Space
Learned h Space
1
x2
h2
1
0
0 0
1
0
x1
1 h1
2
Figure 6.1: Solving the XOR problem by learning a represen representation. tation. The b old num numb b ers prin printed ted on the plot indicate the value that the learned function must output at each p oin oint. t. Figure 6.1: Solving the XOR problem b y learning a represen tation. The b old num b ers (L (Left) eft) A linear mo model del applied directly to the original input cannot implement the XOR printed onWhen the plot indicate valueoutput that the learned function output When at each oin11, t., function. 00,, the the mo model’s del’s must increase as x must x1 = x 1p= 2 increases. (Left) Adel’s linear mo del applied directly the originalAinput the mo model’s output must decrease as xto2 increases. linearcannot modelimplement must applythe a XOR fixed x = 1, function. , thelinear mo del’s output must increase as xthe increases. co coefficien efficien efficienttWhen . 0The mo model del therefore cannot use value of When w 2 toxx2= x 1 to change x moefficient del’s output decrease as this increases. linear ust apply aspace fixed the co coefficient on x2must and cannot solve problem.A(R (Right) ight) model In the m transformed x to co efficiented t wbyto . The linear mo del cannot the value of can change represen represented thex features extracted bytherefore a neural netw network, ork,use a linear model no now w solve x example (R ight) the problem. co efficientInonour and cannot solve this problem. In the transformed space solution, the tw two o p oints that must hav havee output 1 ha have ve b een represented byathe features y a neural netww ork, a linear model can now solve collapsed into single p oin ointt extracted in feature bspace. In other ords, the nonlinear features hav havee > two p oints that must have output 1hha > the problem. our[1 the een x= [1,example , 0] > andsolution, x = [0 [0,, 1] [1 ,b0] mapp mapped ed b othIn to a single p oin ointt in feature sp space, ace, =ve . collapsed into a single p oin t in feature space. In other w ords, the nonlinear features hav e The linear mo model del can now describ describee the function as increasing in h 1 and decreasing in h2. [1,motiv 0] and = [0 , 1] to the = [1mo , 0]del. mapp b oth x = a single p oin t in is feature spmake ace, hthe In thisedexample, the motivation ationx for learning feature space only to model h The linear mo delsocan now describ e thetraining function as In increasing in and decreasinglearned in h . capacit capacity y greater that it can fit the set. more realistic applications, In this example, thealso motiv ation learning the feature space is only to make the mo del represen representations tations can help thefor mo model del to generalize. capacity greater so that it can fit the training set. In more realistic applications, learned representations can also help the mo del to generalize.
172
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
y
h W x
Figure 6.2: An example of a feedforward netw network, ork, drawn in tw two o different styles. Sp Specifically ecifically ecifically,, this is the feedforw feedforward ard netw network ork we use to solve the XOR example. It has a single hidden Figure 6.2: An example of a feedforward netwstyle, ork, drawn in twevery o different Spde ecifically la layer yer containing two units. (L (Left) eft) In this we draw unit styles. as a no node in the, this is the feedforw ard netw ork we use to solve the X OR example. It has a single hidden graph. This style is very explicit and unambiguous but for net networks works larger than this eft) layer containing two units. In this (R style, e draw everywe unit as example it can consume to too o (L muc much h space. (Right) ight)wIn this style, dra draw w aa no node node de in the graph.for This style is vvery explicit and unambiguous but for net works larger than this graph eac each h en entire tire ector representing a la layer’s yer’s activ activations. ations. This st style yle is muc much h more (R ight) example it can consume to o muc h space. In this style, we dra w a no de in the compact. Sometimes we annotate the edges in this graph with the name of the parameters graphdescrib for eacehthe entire vector representing ao la yer’s ations. This stthat yle isa muc h more W that describe relationship b etw etween een tw two la layers. yers. activ Here, we indicate matrix compact. Sometimes wefrom annotate edgesa in this graph withesthe of thefrom parameters describ describes es the mapping vector describes thename mapping x to the h , and w describ h to y. W that describ eomit the relationship etween twoasso layers. Here, indicate that alab matrix W e typically the interceptbparameters associated ciated withwe each lay layer er when labeling eling this describ the mapping from x to h , and a vector w describ es the mapping from h to y. kind of es drawing. We typically omit the intercept parameters asso ciated with each layer when lab eling this kind of drawing.
affine transformation from an input vector to an output scalar. Now, we describ describee an affine transformation from a vector x to a vector h , so an en vector of bias entire tire affine transformation from an input vector to an output scalar. Now, w e describ parameters is needed. The activ activation ation function g is typical typically ly chosen to be a functione h , somodern an affine transformation fromwith a vector a vector andern entire vector of orks, bias h i = gx( xto>W that is applied element-wise, neural netw networks, :,i + ci ). In mo g parameters is needed. The activ ation function is typical ly chosen to be a function the default recommendation is to use the rectifie ctified d line linear ar unit or ReLU (Jarrett h = g ( x W + c that is applied element-wise, with ) . In mo dernby neural netwation orks, et al. al.,, 2009; Nair and Hin Hinton ton, 2010; Glorot et al. al.,, 2011a) defined the activ activation the default is to use the r6.3 ectifie function g (zrecommendation ) = max{0, z } depicted in Fig. . d linear unit or ReLU (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation We can now w max sp specify ecify completeinnetw network function g (zno )= 0, zourdepicted Fig.ork 6.3as . > max {0as , W > x + c} + b. (6.3) (x{; Wour w , b) = wnetw }, c,complete We can now spfecify ork W x + cLet+ b. fecify (x; Wa ,solution c, w , b) = We can no now w sp specify towthemax XOR0,problem. } 1 1{ We can now sp ecify a solution Let Wto=the XOR problem. , 1 1 1 1 W = , 10 1 c= , −1 0 c= 1 , w = 1 , −2 − 1 w = 173 , 2 −
(6.3) (6.4) (6.4) (6.5) (6.5) (6.6) (6.6)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.3: The rectified linear activ activation ation function. This activ activation ation function is the default activ activation ation function recommended for use with most feedforw feedforward ard neural netw networks. orks. Applying Figure 6.3: The linear ation function. Thisyields activation function is the default this function to rectified the output of aactiv linear transformation a nonlinear transformation. activ ation function recommended for use with most feedforw ard neural netw orks. Applying Ho Howev wev wever, er, the function remains very close to linear, in the sense that is a piecewise linear this function thelinear outputpieces. of a linear transformation yieldsunits a nonlinear transformation. function withtotwo Because rectified linear are nearly linear, they However, the yfunction remains linear,mo indels the easy sensetothat is a piecewise linear preserv preserve e man many of the prop properties ertiesvery thatclose maketolinear models optimize with gradientfunction with t wo linear pieces. Because rectified linear units are nearly linear, they based metho methods. ds. They also preserve man many y of the prop properties erties that make linear mo models dels preserve man y ofAthe prop erties that make linear mo dels easy to optimize gradientgeneralize well. common principle throughout computer science is thatwith we can build based methosystems ds. They alsominimal preservecomp manonen y of ts. theMuch prop erties linearmemory mo dels complicated from componen onents. as a Tthat uringmake machine’s generalize well. A common principle throughout computer science is that we can build needs only to b e able to store 0 or 1 states, we can build a universal function approximator complicated from minimal comp onents. Much as a Turing machine’s memory from rectifiedsystems linear functions. needs only to b e able to store 0 or 1 states, we can build a universal function approximator from rectified linear functions.
174
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
and b = 0. We can now walk through the wa way y that the mo model del pro processes cesses a batc batch h of inputs. and b = 0. Let X b e the design matrix containing all four p oints in the binary input space, e can now walk the way that the mo del pro cesses a batch of inputs. withWone example p erthrough row: all four p oints in the binary input space, Let X b e the design matrix containing 0 0 with one example p er row: 0 1 X = (6.7) 10 00 . 0 1 X= 1 1 . (6.7) 1 0 The first step in the neural netw network ork is to the input matrix by the first 1 multiply 1 la layer’s yer’s weight matrix: The first step in the neural network isto0multiply the input matrix by the first 0 layer’s weight matrix: 1 1 (6.8) X W = 10 10 . 1 1 (6.8) XW = 2 2 . 1 1 Next, we add the bias vector c, to obtain 2 2 Next, we add the bias vector c, to obtain 0 −1 1 0 0 1 . (6.9) 1 0 21 −10 . (6.9) 1 0 2 1 a line with slop In this space, all of the examples lie along slopee 1. As we mov movee along this line, the output needs to b egin at 0, then rise to 1, then drop bac back k down to 0. In linear this space, of theimplement examples lie along a line with slop e 1computing . As we mov e along A mo model delallcannot suc such h a function. T o finish the value this line, the output needs to b egin at 0 , then rise to 1 , then drop bac k down to 0. of h for eac each h example, we apply the rectifiedlinear transformation: A linear mo del cannot implement such a function. To finish computing the value of h for each example, we apply the rectified 0 0 linear transformation: 1 0 0 0 . (6.10) 1 0 21 10 . (6.10) 1 0 2 1 This transformation has changed the relationship b etw etween een the examples. They no longer lie on a single line. As shown in Fig. 6.1, they now lie in a space where a This transformation hasthe changed the relationship b etween the examples. They no linear mo model del can solve problem. in Fig. 6.1, they now lie in a space where a longer lie on a single line. As shown We finish by multiplying by theweigh weightt vector w: linear mo del can solve the problem. We finish by multiplying by the weigh 0 t vector w : 1 0 . (6.11) 1 01 . (6.11) 1 175 0
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The neural net netw work has obtained the correct answer for ev every ery example in the batc batch. h. In this example, we simply sp specified ecified the solution, then show showed ed that it obtained The neural network has obtained the correct answer for every example in the batch. zero error. In a real situation, there might b e billions of mo model del parameters and In this example, w e simply sp ecified the solution, then show that itasobtained billions of training examples, so one cannot simply guess the ed solution we did zero error. In a real situation, there might b e billions of mo del parameters and here. Instead, a gradient-based optimization algorithm can find parameters that billions training examples, one cannot simply solution as we did pro produce duceofvery little error. The so solution we describ described ed guess to thethe XOR problem is at a here. Instead, a gradient-based optimization algorithm can find parameters that global minim minimum um of the loss function, so gradient descent could con converge verge to this pro duce very little error. The solution w e describ ed to the X OR problem is at a p oin oint. t. There are other equiv equivalen alen alentt solutions to the XOR problem that gradient global minim um of the loss function, so gradient descent could con verge to descen descentt could also find. The conv convergence ergence p oint of gradien gradientt descen descentt dep depends ends on this the p oint. vThere arethe other equivalenIn t solutions the XOR problem that gradient initial alues of parameters. practice, to gradien gradient t descent would usually not descen t could alsoundersto find. Theo d, conv ergencealued p ointsolutions of gradien t descen t dep on ted the find clean, easily understoo integer-v integer-valued like the one weends presen presented initial values of the parameters. In practice, gradient descent would usually not here. find clean, easily understo o d, integer-valued solutions like the one we presented here.
6.2
Gradien Gradient-Based t-Based Learning
Designing and training a neuralLearning netw network ork is not much differen differentt from training any 6.2 Gradien t-Based other machine learning mo model del with gradient descen descent. t. In Sec. 5.10, we describ described ed Designing and training a neural netw ork is not m uch differen t from training any ho how w to build a machine learning algorithm by sp specifying ecifying an optimization pro procedure, cedure, mo delfamily with. gradient descent. In Sec. 5.10, we describ ed aother cost machine function,learning and a mo model del family. how to build a machine learning algorithm by sp ecifying an optimization pro cedure, Thefunction, largest difference etw etween een the models dels we hav havee seen so far and neural a cost and a mobdel family . linear mo net networks works is that the nonlinearit nonlinearity y of a neural netw network ork causes most interesting loss The largest difference b etw een the linear mo dels have seen so farare andusually neural functions to become non-conv non-convex. ex. This means thatwe neural net networks works networksbyis using that the nonlinearit y of a neuraloptimizers network causes most interesting loss trained iterativ iterative, e, gradient-based that merely drive the cost functionstotoa become ex. This thatequation neural net worksused are to usually function very lo low wnon-conv value, rather thanmeans the linear solvers train trained by using iterativ e, gradient-based optimizers that merely drive the cost linear regression mo models dels or the con convex vex optimization algorithms with global conv convererfunction to a v ery lo w v alue, rather than the linear equation solvers used to train gence guarantees used to train logistic regression or SVMs. Conv Convex ex optimization linear regression mofrom dels or theinitial convex optimization global ercon converges verges starting any parameters (in algorithms theory—in with practice it conv is very gence guarantees used to ntrain logistic regression or SVMs. Convex optimization robust but can encounter umerical problems). Sto Stoc chastic gradient descent applied con verges starting from any initial parameters (in theory—in practice it is very to non-conv non-convex ex loss functions has no such conv convergence ergence guarantee, and is sensitive robust but can encounter n umerical problems). Sto c hastic gradient descent applied to the values of the initial parameters. For feedforward neural net networks, works, it is to non-conv exinitialize loss functions has ts notosuch conv ergencevalues. guarantee, is sensitive imp importan ortan ortantt to all weigh weights small random The and biases ma may y be to the v alues of the initial parameters. F or feedforward neural net works, it is initialized to zero or to small p ositiv ositivee values. The iterativ iterativee gradient-based optiimp ortanalgorithms t to initialize alltoweigh to small random values. The biases madeep y be mization used traintsfeedforward netw networks orks and almost all other initialized to zero or to small p ositiv e v alues. The iterativ e gradient-based optimo models dels will b e describ described ed in detail in Chapter 8, with parameter initialization in mization algorithms to train networks and almost all other deep particular discussed used in Sec. 8.4. Ffeedforward or the moment, it suffices to understand that mo dels will b e describis edalmost in detail inysChapter with the parameter in the training algorithm alwa always based on8,using gradientinitialization to descend the particular discussed Sec. 8.4. For the it suffices toare understand that cost function in one in way or another. Themoment, sp specific ecific algorithms impro improv vements the training algorithm is almost alwa ys based on using the gradient to descend the and refinemen refinements ts on the ideas of gradient descent, in intro tro troduced duced in Sec. 4.3, and, cost function in one way or another. The sp ecific algorithms are improvements 176 descent, intro duced in Sec. 4.3, and, and refinements on the ideas of gradient
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
more sp specifically ecifically ecifically,, are most often improv improvements ements of the sto stochastic chastic gradien gradientt descent algorithm, introduced in Sec. 5.9. more sp ecifically, are most often improvements of the sto chastic gradient descent We can of course, train mo models dels such as linear regression and supp support ort vector algorithm, introduced in Sec. 5.9. mac machines hines with gradien gradientt descen descentt to too, o, and in fact this is common when the training W e can of course, train mo dels suchofasview, linear regression and supp ort visector set is extremely large. From this p oint training a neural net network work not mac hines witht gradien t descenany t toother o, andmo indel. factComputing this is common when theis training m uch differen different from training model. the gradient sligh slightly tly set is extremely large. F rom this p oint of view, training a neural net work is not more complicated for a neural netw network, ork, but can still b e done efficiently and exactly exactly.. m uch differen t from training any other mo del. Computing the gradient is sligh tly Sec. 6.5 will describ describee ho how w to obtain the gradien gradientt using the back-propagation more complicated for a neural netw ork, but can still b e done efficiently and exactly . algorithm and mo modern dern generalizations of the back-propagation algorithm. Sec. 6.5 will describ e how to obtain the gradient using the back-propagation As withand other machine learning mo models, apply gradien gradient-based t-based learning we algorithm mo dern generalizations ofdels, the to back-propagation algorithm. must cho hoose ose a cost function, and we must choose how to represent the output of As with other machine mo dels, to apply gradien t-based learning we the mo model. del. W e no now w revisit learning these design considerations with sp special ecial emphasis on m ust c ho ose a cost function, and we must choose how to represent the output of the neural netw networks orks scenario. the mo del. We now revisit these design considerations with sp ecial emphasis on the neural networks scenario. An imp important ortant asp aspect ect of the design of a deep neural net network work is the choice of the cost function. Fortunately ortunately,, the cost functions for neural netw networks orks are more or less An imp ortant asp ect of the design of a deep neural net work the choice of the the same as those for other parametric mo models, dels, suc such h as linear is mo models. dels. cost function. Fortunately, the cost functions for neural networks are more or less In most cases, our parametric mo model del defines a distribution p ( y | x; θ ) and the same as those for other parametric mo dels, such as linear mo dels. we simply use the principle of maximum likelihoo likelihood. d. This means we use the p ( y asx;the θ ) cost In most cases, our parametric mo del defines a distribution and cross-en cross-entropy tropy b etw etween een the training data and the mo model’s del’s predictions w e simply use the principle of maximum likelihoo d. This means w e use the | function. cross-entropy b etween the training data and the mo del’s predictions as the cost Sometimes, we take a simpler approach, where rather than predicting a complete function. probabilit probability y distribution ov over er y , we merely predict some statistic of y conditioned Sometimes, w e take a simpler approach, than predicting a complete on x. Sp Specialized ecialized loss functions allow us towhere trainrather a predictor of these estimates. probability distribution over y , we merely predict some statistic of y conditioned The total cost function used to train a neural net network work will often combine one on x. Sp ecialized loss functions allow us to train a predictor of these estimates. of the primary cost functions describ described ed here with a regularization term. We hav havee The total cost function used to train a neural net work will often combine one already seen some simple examples of regularization applied to linear mo models dels in Sec. of the. The primary cost functions describ edfor here withmo a dels regularization term.applicable We have 5.2.2 5.2.2. weigh weight t decay approach used linear models is also directly already seen some simple examples of regularization applied to linear mo dels in Sec. to deep neural netw networks orks and is among the most p opular regularization strategies. 5.2.2. adv Theanced weighregularization t decay approach used forfor linear monetw dels is alsowill directly applicable More advanced strategies neural networks orks b e describ described ed in to deep neural netw orks and is among the most p opular regularization strategies. Chapter 7. More advanced regularization strategies for neural networks will b e describ ed in Chapter 7. Most mo modern dern neural net networks works are trained using maxim maximum um lik likeliho eliho elihoo o d. This means that the cost function is simply the negative log-likelihoo equiv log-likelihood, d, equivalen alen alently tly describ described ed Most mo dern neural networks are trained using maximum likeliho o d. This means 177 that the cost function is simply the negative log-likelihoo d, equivalently describ ed
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
as the cross-en cross-entropy tropy b et etw ween the training data and the mo model del distribution. This cost function is giv given en by as the cross-entropy b etween the training data and the mo del distribution. This cost function is given bJy(θ ) = −E , ∼pˆ log pmodel(y | x). (6.12) E log p from (θ ) cost = function changes (y mo x)del . to mo (6.12) The sp specific ecific form ofJ the model model, del, dep depending ending − . The expansion of the | ab on the sp specific ecific form of log pmodel abo ove equation typically The sp ecific form of the cost function c hanges from mo del to mo del, depma ending yields some terms that do not dep depend end on the mo model del parameters and may y be on the sp ecific form of . The expansion of the ab o ve equation typically log p discarded. For example, as we saw in Sec. 5.5.1, if pmodel (y | x) = N (y ; f (x; θ) , I ), yieldswesome terms not dep end cost, on the mo del parameters and may b e then recov recover er the that meandosquared error (y x) = (y ; f (x; θ) , I ), discarded. For example, as we saw in Sec. 5.5.1, if p 1 then we recover theJmean | , N ||ycost, − f (x; θ )||2 + const (θ ) = squared E , ∼pˆ error const, (6.13) 2 1E y f (x; θ ) + const, J (θ ) =1 (6.13) up to a scaling factor of 2 and does es not dep depend end on θ . The discarded 2 a term that do || Gaussian − || constan constantt is based on the variance of the distribution, which in this case up to a scaling factor of and a term that do es not dep end on θ .alence The discarded we chose not to parametrize. Previously Previously,, we saw that the equiv equivalence betw between een constanum t islik based on variancewith of the distribution, which in this case maxim maximum likeliho eliho elihoo o d the estimation an Gaussian output distribution and minimization of w e chose not to parametrize. Previously , we saw that the equiv alence betw een mean squared error holds for a linear model, but in fact, the equiv equivalence alence holds maxim um lik eliho o d estimation with an output distribution and minimization of regardless of the f (x; θ ) used to predict the mean of the Gaussian. mean squared error holds for a linear model, but in fact, the equivalence holds An adv advantage antage approach of deriving the of cost from maximum regardless of the f of (x;this θ ) used to predict the mean thefunction Gaussian. lik likeliho eliho elihoo o d is that it remov removes es the burden of designing cost functions for each mo model. del. An adv antage of this approach of deriving the cost function from maximum Sp Specifying ecifying a mo model del p(y | x ) automatically determines a cost function log p (y | x ). likeliho o d is that it removes the burden of designing cost functions for each mo del. One recurring theme throughout neural netw network ork design is that the gradien gradientt of Sp ecifying a mo del p(y x ) automatically determines a cost function log p (y x ). the cost function must b e large and predictable enough to serve as a go good od guide | throughout neural network design is that the gradien| t of One recurring theme for the learning algorithm. Functions that saturate (b (become ecome very flat) undermine the cost function must b e large and predictable enough to serve go od guide this ob objectiv jectiv jectivee b ecause they make the gradien gradientt b ecome very small.asIna many cases for the learning algorithm. F unctions that saturate (b ecome v ery flat) undermine this happ happens ens b ecause the activ activation ation functions used to pro produce duce the output of the this ob jectiv b ecause they make gradienThe t b ecome very small. In many cases hidden unitse or the output units the saturate. negativ to negativee log-likelihoo log-likelihood d helps this happ ens b ecause the activ ation functions used to pro duce the output of the avoid this problem for many models. Man Many y output units inv involv olv olvee an exp function hidden units or the output units saturate. The negativ e log-likelihoo d helps to that can saturate when its argumen argumentt is very negative. The log function in the a void this problem for many models. Man y output units inv olv e an function exp negativ negativee log-lik log-likeliho eliho elihooo d cost function undo undoes es the exp of some output units. We will logoffunction that can saturate when its argumen t is very negative. in the discuss the interaction b et etwe we ween en the cost function and the The choice output unit in negativ e log-lik eliho o d cost function undo es the exp of some output units. We will Sec. 6.2.2 . discuss the interaction b etween the cost function and the choice of output unit in One un unusual usual prop propert ert erty y of the cross-entrop cross-entropy y cost used to p erform maximum Sec. 6.2.2. lik likeliho eliho elihooo d estimation is that it usually do does es not ha have ve a minimum value when applied One un usual prop ert y of the cross-entrop y usedoutput to p erform maximum to the mo models dels commonly used in practice. For cost discrete variables, most likeliho d estimation is that usually do notthey havecannot a minimum value awhen applied mo models delsoare parametrized initsuch a wa way y es that represent probability to zero the mo usedarbitrarily in practice.close Fortodiscrete output variables, most of or dels one,commonly but can come doing so. Logistic regression mo dels are parametrized in such a wa y that they cannot represent a probability is an example of such a mo model. del. For real-v real-valued alued output variables, if the mo model del of zero or one, but can come arbitrarily close to doing so. Logistic regression 178 alued output variables, if the mo del is an example of such a mo del. For real-v
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
can control the density of the output distribution (for example, by learning the variance parameter of a Gaussian output distribution) then it b ecomes p ossible canassign control the density the output (for example, by learning to extremely highofdensity to thedistribution correct training set outputs, resultingthe in v ariance parameter of a Gaussian output distribution) then it b ecomes p ossible cross-en cross-entropy tropy approaching negativ negativee infinity infinity.. Regularization tec techniques hniques describ described ed to assign extremely high density to the correct training set outputs, resulting in in Chapter 7 provide sev several eral different wa ways ys of mo modifying difying the learning problem so cross-en e infinity . Regularization that thetropy mo model delapproaching cannot reapnegativ unlimited reward in this wa way y. techniques describ ed in Chapter 7 provide several different ways of mo difying the learning problem so that the mo del cannot reap unlimited reward in this way. Instead of learning a full probabilit probability y distribution p(y | x ; θ ) we often wan wantt to learn just one conditional statistic of y giv given en x. Instead of learning a full probability distribution p(y x ; θ ) we often want to learn For example, we may hav havee a predictor f (x ; θ) that we wish to predict the mean just one conditional statistic of y given x. | of y . For example, we may have a predictor f (x ; θ) that we wish to predict the mean If we use a sufficien sufficiently tly p owerful neural netw network, ork, we can think of the neural of y . net network work as b eing able to represent an any y function f from a wide class of functions, we use tly ponly owerful neural netw we can think the neural withIfthis classa bsufficien eing limited by features suc such hork, as contin continuit uit uity y andof b oundedness f from networkthan as bbeing able any function a wide rather y ha having vingtoa represent specific parametric form. From this class pointofoffunctions, view, we with this class b eing limited only by features suc h as contin uit y and b oundedness can view the cost function as being a functional rather than just a function. A rather thanisbay mapping having a from specific parametric form. Fbrom pointth of functional functions to real num numb ers. this We can thus us view, think we of can viewas the cost function as being a than functional rather thanajust a function. A learning choosing a function rather merely choosing set of parameters. functional is a mapping from functions toe real numb ers. occur We can us think of W e can design our cost functional to hav have its minimum at th some sp specific ecific learning as rather thandesign merely choosing a set of parameters. function wechoosing desire. Faorfunction example, we can the cost functional to hav havee its W e can design our cost functional to hav e its minimum occur at some spen ecific x. minim minimum um lie on the function that maps x to the exp expected ected value of y giv given functionanwoptimization e desire. Forproblem example, weresp canect design the costrequires functional to have its Solving with respect to a function a mathematical x to y given to x. minim um lie on theoffunction that mapsed the exp ected alue to tool ol called calculus variations variations, , describ described in Sec. 19.4.2 . Itvis notofnecessary Solving an optimization with ect to a function requires a mathematical understand calculus of problem variations to resp understand the conten content t of this chapter. A Att to ol called c alculus of variations , describ ed in Sec. 19.4.2 . It is not necessary the moment, it is only necessary to understand that calculus of variations ma may y btoe understand calculus of variations to understand the content of this chapter. At used to derive the following two results. the moment, it is only necessary to understand that calculus of variations may b e Our first result derived using calculus of variations is that solving the optimizaused to derive the following two results. tion problem Our first result derived calculus is)|| that solving the optimiza2 (6.14) ||y − f (x f ∗ =using arg min E , ∼pof variations tion problem f E (6.14) y f (x) f = arg min yields || − || f ∗ (x) = E ∼p (6.15) (y |x)[y ], yields E so long as this function lies within if we over. er. In other words,(6.15) f (x)the = class we optimize [y ], ov could train on infinitely man many y samples from the true data-generating distribution, so long as this function lies within class we optimize er. In other words, if the we minimizing the mean squared errorthe cost function gives a ov function that predicts could train on infinitely man y samples from the true data-generating distribution, mean of y for each value of x. minimizing the mean squared error cost function gives a function that predicts the 179 mean of y for each value of x.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Differen Differentt cost functions give differen differentt statistics. A second result deriv derived ed using calculus of variations is that Different cost functions give different statistics. A second result derived using calculus of variations isf ∗that = arg min E , ∼p ||y − f (x)||1 (6.16) f E f = arg min y f (x) (6.16) yields a function that predicts the value of y for eac each h x , so long as such a || − || function may b e describ described ed by the family of functions we optimize over. This cost yields a function that predicts the alue function is commonly called me mean an absolute verr error or or..of y for each x , so long as such a function may b e describ ed by the family of functions we optimize over. This cost Unfortunately Unfortunately,, mean squared error and mean absolute error often lead to p o or function is commonly called mean absolute error. results when used with gradient-based optimization. Some output units that Unfortunately meansmall squared error and mean absolute error often lead to p o or saturate pro produce duce,very gradients when combined with these cost functions. results when usedthat withthe gradient-based optimization. units mean that This is one reason cross-entrop cross-entropy y cost function is Some more poutput opular than saturateerror pro duce very absolute small gradients when combined these cost functions. squared or mean error, ev even en when it is notwith necessary to estimate an This is one reason that the cross-entrop y cost function is more p opular than mean en entire tire distribution p(y | x). squared error or mean absolute error, even when it is not necessary to estimate an entire distribution p(y x). | The choice of cost function is tightly coupled with the choice of output unit. Most of the time, we simply use the cross-entrop cross-entropy y b et etween ween the data distribution and the The c hoice of cost function is tightly coupled with thethe choice of output unit. Most mo model del distribution. The choice of how to represent output then determines of the time, w e simply use the cross-entrop y b et ween the data distribution and the the form of the cross-entrop cross-entropy y function. mo del distribution. The choice of how to represent the output then determines Any y kind of cross-entrop neural netw network unit that may b e used as an output can also b e the An form of the york function. used as a hidden unit. Here, we fo focus cus on the use of these units as outputs of the An y kind of neural netw ork unit thatinternally may b e used as an also be mo model, del, but in principle they can b e used as well. Weoutput revisit can these units used as a hidden unit. Here, we fo cus on the use of these units as outputs of the with additional detail ab about out their use as hidden units in Sec. 6.3. mo del, but in principle they can b e used internally as well. We revisit these units Throughout this section, we supp suppose ose that the feedforw feedforward ard net network work pro provides vides a with additional detail ab out their use as hidden units in Sec. 6.3. set of hidden features defined by h = f (x; θ). The role of the output lay layer er is then Throughout this section, we supp ose that the feedforw ard net work pro vides a to provide some additional transformation from the features to complete the task set ofthe hidden features by h = f (x; θ). The role of the output layer is then that netw network ork mustdefined p erform. to provide some additional transformation from the features to complete the task that the network must p erform. One simple kind of output unit is an output unit based on an affine transformation with no nonlinearity nonlinearity.. These are often just called linear units. One simple kind of output unit is an output unit based on an affine transformation Given features h,.aThese la layer yer of linear units linear pro produces duces a vector yˆ = W > h+ b. withGiv noennonlinearity are oftenoutput just called units. Linear outputhlay layers ers are often used to pro produce duce the mean of a conditional Given features , a layer of linear output units pro duces a vector yˆ = W h+ b. Gaussian distribution: Linear output layers are poften duce yˆ, I ). the mean of a conditional (6.17) (y | xused ) = Nto (y ;pro Gaussian distribution: (6.17) p(y x) =180 (y ; yˆ, I ). | N
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Maximizing the log-likelihoo log-likelihood d is then equiv equivalent alent to minimizing the mean squared error. Maximizing the log-likelihoo d is then equivalent to minimizing the mean squared The maximum likelihoo likelihood d framew framework ork mak makes es it straigh straightforward tforward to learn the error. co cov variance of the Gaussian to too, o, or to mak makee the cov covariance ariance of the Gaussian b e a The maximum likelihoo d framew ork mak es it tforward thee function of the input. How However, ever, the cov covariance ariance must bstraigh e constrained to to b e learn a p ositiv ositive covariance of the to makto e the covariance of the ts Gaussian be a definite matrix for Gaussian all inputs.toIto,isordifficult satisfy suc such h constrain constraints with a linear functionla of thesoinput. How ever,output the covunits ariance e constrained b ecov a pariance. ositive output layer, yer, typically other aremust usedbto parametrizetothe covariance. definite matrix for deling all inputs. It is difficult satisfyedsuc h constrain ts with a linear Approac Approaches hes to mo modeling the cov covariance ariance aretodescrib described shortly shortly, , in Sec. 6.2.2.4 . output layer, so typically other output units are used to parametrize the covariance. Because linear units the do not saturate, pose difficult difficulty y for gradien gradientApproac hes to mo deling covariance arethey describ edlittle shortly , in Sec. 6.2.2.4 . tbased optimization algorithms and ma may y b e used with a wide variety of optimization Because linear units do not saturate, they pose little difficulty for gradientalgorithms. based optimization algorithms and may b e used with a wide variety of optimization algorithms. Man Many y tasks require predicting the value of a binary variable y . Classification problems with tw two o classes can b e cast in this form. Many tasks require predicting the value of a binary variable y . Classification The maximum-lik maximum-likeliho eliho elihood od approach is to define a Bernoulli distribution over y problems with two classes can b e cast in this form. conditioned on x. The maximum-likeliho od approach is to define a Bernoulli distribution over y A Bernoulli is defined by just a single num numb b er. The neural net conditioned on xdistribution . needs to predict only P ( y = 1 | x). For this num numb b er to b e a valid probability probability,, it A Bernoulli distribution is defined b y just a single num b er. The neural net must lie in the in interv terv terval al [0, 1]. needs to predict only P ( y = 1 x). For this numb er to b e a valid probability, it Satisfying this constraint requires some careful design effort. Supp Suppose ose we were must lie in the interval [0, 1]. | to use a linear unit, and threshold its value to obtain a valid probabilit probability: y: Satisfying this constraint requires some careful design effort. Supp ose we were n n oo > to use a linear unit, its value to obtain y: (6.18) P (and y = threshold 1 | x) = max 0, min 1, w ha+valid b probabilit .
(y = a1 valid x) =conditional max 0, min 1, w h +but b we . would not b(6.18) This would indeed Pdefine distribution, e able | to train it very effectiv effectively ely with gradient descent. Any time that w>h + b stra strayed yed This would indeed define a v alid conditional distribution, but we would not b e able outside the unit interv interval, al, the gradient of the output of the mo model del with resp respect ect to w h + b to train it v ery effectiv ely with gradient descent. Any time that stra its parameters would b e 0. A gradientnof 0 is ntypically problematic b ecause yed the oo outside the unit interv al, the gradient of the output of the mo del with resp ect to learning algorithm no longer has a guide for how to impro improv ve the corresponding its parameters would b e 0. A gradient of 0 is typically problematic b ecause the parameters. learning algorithm no longer has a guide for how to improve the corresponding Instead, it is b etter to use a different approach that ensures there is alwa always ys a parameters. strong gradien gradientt whenever the mo model del has the wrong answ answer. er. This approach is based Instead, it is b etter to use a different approach that on using sigmoid output units combined with maximum ensures lik likeliho eliho elihoo othere d. is always a strong gradient whenever the mo del has the wrong answer. This approach is based A sigmoid output unit is defined by on using sigmoid output units combined with maximum likeliho o d. A sigmoid output unit is defined yˆ = σbyw>h + b yˆ = σ 181 w h+b
(6.19) (6.19)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
where σ is the logistic sigmoid function describ described ed in Sec. 3.10. We can think of the sigmoid output unit as having tw two o comp componen onen onents. ts. First, it where σ is the logistic sigmoid function describ ed in Sec. 3.10. > uses a linear lay layer er to compute z = w h + b. Next, it uses the sigmoid activ activation ation W e can think of the sigmoid output unit as having tw o comp onen ts. First, it function to conv convert ert z into a probability probability.. uses a linear layer to compute z = w h + b. Next, it uses the sigmoid activation We omit the dep dependence endence on x for the moment to discuss how to define a function to convert z into a probability. probabilit probability y distribution ov over er y using the value z . The sigmoid can b e motiv motivated ated x W e omit the dep endence on for the moment to discuss how to define a ˜ by constructing an unnormalized probabilit probability y distribution P ( y ), whic which h do does es not y z probabilit y distribution ov er using the v alue . The sigmoid can b e motiv ated sum to 1. We can then divide by an appropriate constant to obtain a valid ˜( y ), which do es not b y constructing an unnormalized y distribution probabilit probability y distribution. If we b eginprobabilit with the assumption thatPthe unnormalized log sum to 1. W e can an appropriate a valid probabilities are linearthen in y divide and z , by we can exponentiateconstant to obtainto theobtain unnormalized probability distribution. If we b egintowith the unnormalized log probabilities. We then normalize see the thatassumption this yieldsthat a Bernoulli distribution y z probabilities linear in transformation and , we can exponentiate to obtain the unnormalized con controlled trolled byare a sigmoidal of z : probabilities. We then normalize to see that this yields a Bernoulli distribution controlled by a sigmoidal transformation log P˜(y ) = y z of z : (6.20) ˜ (6.21) log P P˜((yy)) = = exp( yz yz) (6.20) exp(y z ) (6.21) P˜(y ) = exp( (6.22) P1 y z ) exp(y y 0 z) y exp( =0 exp( yz) P ( y ) = (6.22) P (y ) = σ ((2y − 1)zy) z. ) (6.23) exp( (y )exp =σ ((2tiation y 1)z )and . normalization are common (6.23) Probabilit Probability y distributions basedPon exponen onen onentiation throughout the statistical mo modeling deling literature. such h a − The z variable defining suc Probabilit y distributions based on exp onen tiation and normalization are common distribution ov over er binary variables is called logit git git.. P a lo throughout the statistical mo deling literature. The z variable defining such a This approach to predicting the probabilities in log-space is natural to use distribution over binary variables is called a logit. with maximum likelihoo likelihood d learning. Because the cost function used with maxim maximum um This approach to predicting the probabilities in log-space is natural to use lik likeliho eliho elihoo o d is − log P ( y | x), the log in the cost function undo undoes es the exp of the with maximum likelihoo d learning. Because the cost function used with um sigmoid. Without this effect, the saturation of the sigmoid could prev prevent entmaxim gradientlog P ( ymaking x), the likeliholearning o d is log in the costThe function undo es the of the based from goo good d progress. loss function for exp maxim maximum um sigmoid. the saturation of the could − this |Bernoulli lik likeliho eliho elihoo o dWithout learning of aeffect, parametrized by sigmoid a sigmoid is prevent gradientbased learning from making goo d progress. The loss function for maximum likeliho o d learning of a Bernoulli J (θ ) =parametrized − log P (y | xby ) a sigmoid is (6.24) = − log σ ((2y − 1)z ) (6.25) J (θ ) = log P (y x) (6.24) =− ζ ((1 − 2y )z ) . (6.26) = log σ ((2|y 1)z ) (6.25) ζ−((1 prop 2yerties )z )−. from Sec. 3.10. By rewriting (6.26) This deriv derivation ation mak makes es use of=some properties − we can see that it saturates only when the loss in terms of the softplus function, This deriv ation mak es use of some prop Sec. 3.10 Bydelrewriting (1 − 2y )z is very negative. Saturation thus oerties ccurs from only when the .mo model already the loss in terms of the softplus function, we can see that it saturates only has the right answ answer—when er—when y = 1 and z is very p ositiv ositive, e, or y = 0 and z iswhen very (1 2 y ) z is very negative. Saturation thus o ccurs only when the mo del already negativ negative. e. When z has the wrong sign, the argument to the softplus function, has−the right answer—when y = 1 and z is very p ositive, or y = 0 and z is very 182the argument to the softplus function, negative. When z has the wrong sign,
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(1 (1− − 2 y )z , may b e simplified to |z |. As |z | b ecomes large while z has the wrong sign, the softplus function asymptotes tow toward ard simply returning its argumen argumentt |z |. The (1 2 y ) z z z z , may b e simplified to . As b ecomes large while has sign, deriv derivative ative with resp respect ect to z asymptotes to sign ofwrong extremely sign((z), so, in the limitthe z the function asymptotes tow simplythe returning . The − softplus | |do | shrink z , the incorrect softplus function does es |ard not gradien gradientits t atargumen all. Thist prop property erty deriv ative with resp ect to asymptotes to ) , so, in the limit of extremely z sign ( z | | is very useful b ecause it means that gradien gradient-based t-based learning can act to quic quickly kly incorrecta zmistaken , the softplus correct z . function do es not shrink the gradient at all. This prop erty is very useful b ecause it means that gradient-based learning can act to quickly When we use other loss functions, such as mean squared error, the loss can correct a mistaken z . saturate anytime σ(z ) saturates. The sigmoid activ activation ation function saturates to 0 When we use other loss functions, such as mean squared error,very the ploss can when z b ecomes very negativ negativee and saturates to 1 when ositiv ositive. e. z b ecomes σ(z ) saturates. saturate anytime The activ ation function to 0 The gradien gradient t can shrink to too o small to b esigmoid useful for learning wheneversaturates this happ happens, ens, when z bthe ecomes e andanswer saturates to incorrect 1 when zanswer. b ecomesFor very e. whether mo model delvery has negativ the correct or the thisp ositiv reason, The gradien t eliho can shrink to o small toysb ethe useful for learning whenever this happ ens, maxim maximum um lik likeliho elihoo o d is almost alwa always preferred approach to training sigmoid whether the mo del has the correct answer or the incorrect answer. F or this reason, output units. maximum likeliho o d is almost always the preferred approach to training sigmoid Analytically Analytically,, the logarithm of the sigmoid is alwa always ys defined and finite, b ecause output units. the sigmoid returns values restricted to the op open en interv interval al (0 (0,, 1) 1),, rather than using Analytically , the logarithm of the sigmoid is alwa ys defined finite, b ecause the entire closed interv interval al of valid probabilities [0 [0,, 1] 1].. In softw software areand implementations, theav sigmoid returns problems, values restricted to the op en interv al (0, 1) rather thandusing to avoid oid numerical it is b est to write the negativ negative e ,log-likelihoo log-likelihood as a the entireofclosed intervthan al of as valid probabilities are sigmoid implementations, z, rather z ).softw function a function of yˆ[0=, 1]σ. (In If the function to avoidws numerical problems, is blogarithm est to write negativ e log-likelihoo underflo underflows to zero, then takingitthe of yˆthe yields negativ negative e infinit infinity y. d as a z y ˆ = σ ( z function of , rather than as a function of ). If the sigmoid function underflows to zero, then taking the logarithm of yˆ yields negative infinity. An Any y time we wish to represen representt a probability distribution ov over er a discrete variable with n p ossible values, we ma may y use the softmax function. This can b e seen as a An y time we wish to sigmoid representfunction a probability er a discrete variabley generalization of the whic which h distribution was used to ov represen represent t a probabilit probability n p ossible with wevma y use the softmax function. This can b e seen as a distribution ov over ervalues, a binary ariable. generalization of the sigmoid function which was used to represent a probability Softmax functions are most often used as the output of a classifier, to represen representt distribution over a binary variable. the probabilit probability y distribution ov over er n differen differentt classes. More rarely rarely,, softmax functions functions most oftenifused as the to represen can Softmax b e used inside the are mo model del itself, we wish theoutput mo model deloftoacclassifier, ho hoose ose b et etw ween one oft thedifferent probabilit y distribution er t classes. More rarely, softmax functions n differen n options for someov in internal ternal variable. can b e used inside the mo del itself, if we wish the mo del to cho ose b etween one of In the case of binary variables, we wished to pro produce duce a single num numb b er n different options for some internal variable. In the case of binary variables, to).pro duce a single numb er (6.27) yˆ =we P (wished y=1|x (yween = 10x ). 1, and b ecause we wan (6.27) Because this number needed to yˆlie= bPet etween and wanted ted the logarithm of the number to b e well-b ell-beha eha ehaved ved| for gradien gradient-based t-based optimization of Because this number to instead lie b etween 0 and 1, and x). = logwP˜e(ywan = ted 1 | the the log-likelihoo log-likelihood, d, weneeded chose to predict a num numb b erbzecause logarithm of the n umber to b e w ell-b eha ved for gradien t-based optimization of Exp Exponen onen onentiating tiating and normalizing gav gavee us a Bernoulli distribution controlled by the ˜ (y = 1 x). z = log P the log-likelihoo d, we c hose to instead predict a num b er sigmoid function. Exp onentiating and normalizing gave us a Bernoulli distribution controlled by| the 183 sigmoid function.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
To generalize to the case of a discrete variable with n values, we now need to pro produce duce a vector yˆ, with yˆi = P (y = i | x ). We require not only that eac each h n T o generalize to the case of a discrete v ariable with v alues, w e now need elemen elementt of yˆi b e b et etween ween 0 and 1, but also that the entire vector sums to 1 so that yˆ, with yˆy distribution. = P (y = i x to pro duce a v ector ). W e require not that only wthat eac h it represents a valid probabilit probability The same approach orked for elemen t of b e b et ween 0 and 1 , but also that the entire vector sums to 1 so that y ˆ | the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear it represents a v alid probabilit y distribution. la layer yer predicts unnormalized log probabilities: The same approach that worked for the Bernoulli distribution generalizes to the multinoulli distribution. First, a linear layer predicts unnormalized log probabilities: z = W > h + b, (6.28) z= W h+ b, (6.28) where z i = log P˜( y = i | x) . The softmax function can then exp exponentiate onentiate and normalize z to obtain the desired yˆ. Formally ormally,, the softmax function is given by where z = log P˜( y = i x) . The softmax function can then exp onentiate and exp( exp(z zi) softmax function is given by normalize z to obtain the| desired yˆ. Formally , the softmax(z )i = P . (6.29) exp(zzj ) j exp( exp(z ) softmax(z ) = . (6.29) exp( z ) As with the logistic sigmoid, the use of the exp function works very well when training the softmax to output a target value y using maximum log-likelihoo log-likelihood. d. In exp As with the logistic sigmoid, the use of the function works very well softmax((z )i. Definingwhen this case, we wish to maximize log P (y = i ; z ) = log softmax the training the softmax to output a target v alue y using maximum log-likelihoo d. In softmax in terms of exp is natural b ecauseP the log in the log-likelihoo log-likelihood d can undo log P ( = i ; z ) = log softmax ( z ) this case, we wish to maximize y . Defining the the exp of the softmax: softmax in terms of exp is natural b ecause the log in the log-likelihoo d can undo X the exp of the softmax: log softmax(z )i = zi − log exp( exp(zz j ). (6.30) j
log softmax(z ) = z exp(z ). log (6.30) The first term of Eq. 6.30 sho shows ws that−the input zi alw always ays has a direct contribution to the cost function. Because this term cannot saturate, we know that z alw The first term of even Eq. 6.30 ws that theof input ays has direct z i to the learning can pro proceed, ceed, if thesho contribution second terma of Eq. con6.30 X tribution to the cost function. Because this term cannot saturate, we know that b ecomes very small. When maximizing the log-likelihoo log-likelihood, d, the first term encourages can pro ceed, even if the contribution of z to the second of Eq.down. 6.30 zlearning z to term all of b e pushed i to b e pushed up, while the second term encourages P boecomes very intuition small. When maximizing the log-likelihoo first eterm log j exp exp((d, z j)the T gain some for the second term, , observ observe thatencourages this term zcantobbeeroughly z pushedapproximated up, while the by second term encourages all of to b e pushed maxj z j. This approximation is based on thedown. idea log exp ( z T o gain some intuition for the second term, ) , observ e that this term that exp exp((z k ) is insignificant for any z k that is noticeably less than max j z j. The max z . This approximation can b e roughly byapproximation is based on the idea in intuition tuition we canapproximated gain from this is that the negative log-lik log-likeliho eliho elihoo od that function insignificant any z the thatmost is noticeably less than exp(z ) is max z . IfThe cost alwa always ys stronglyfor p enalizes activ activee incorrect prediction. the intuitionanswer we canalready gain from thatsoftmax, the negative od correct has this the approximation largest input toisthe then log-lik the −zeliho i term P P cost the function ys( zstrongly p enalizes the most active incorrect prediction. If the log alwa exp( and j ) ≈ maxj z j = zi terms will roughly cancel. This example j exp correct answer already has the input tocost, the softmax, the z term will then contribute little to the largest ov overall erall training whic which h willthen b e dominated by max log exp ) not = z terms and the will roughly cancel. This−example other examples that( zare yetzcorrectly classified. will then contribute little ≈ to the overall training cost, which will b e dominated by So far we hav havee discussed only a single example. Ov Overall, erall, unregularized maximum other examples that are not yet correctly classified. lik likeliho eliho elihooo d will drive the mo model del to learn parameters that drive the softmax to predict So far we have discussed only a single example. Overall, unregularized maximum P 184 likeliho o d will drive the mo del to learn parameters that drive the softmax to predict
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the fraction of coun counts ts of each outcome observed in the training set: Pm the fraction of counts of each outcome observed the training set: j=1 1yin =i,x =x Pm . (6.31) softmax(z (x; θ )) i ≈ 1x =x j=1 1 . (6.31) softmax(z (x; θ )) 1 this is guaranteed to happ Because maximum lik likeliho eliho elihooo d is a consistent estimator, happen en ≈ so long as the mo model del family is capable of represen representing ting the training distribution. In Because maximum likdel eliho o d is a consistent estimator, this is guaranteed happthe en practice, limited mo model capacity and imp imperfect erfect optimization will meantothat P so long as the mo del family is capable of represen ting the training distribution. In mo model del is only able to appro approximate ximate these fractions. practice, limited mo del capacity and imp erfect P optimization will mean that the Man Many y ob objectiv jectiv jectivee functions other than the log-likelihoo log-likelihood d do not work as well mo del is only able to approximate these fractions. with the softmax function. Sp Specifically ecifically ecifically,, ob objectiv jectiv jectivee functions that do not use a log to Man ob jectiv e functions other than when the log-likelihoo dtdo workb ecomes as well undo they exp of the softmax fail to learn the argumen argument to not the exp with the softmax function. Sp ecifically , ob jectiv e functions that do not use a very negative, causing the gradien gradientt to vanish. In particular, squared errorlogistoa undo the exp of theforsoftmax to learn whenfail the the b ecomes p oor loss function softmaxfail units, and can toargumen train thet to mo model del exp to change its voutput, ery negative, causing thedel gradien to vanish. In particular, squared error is a, even when the mo model makest highly confiden confident t incorrect predictions (Bridle p oor).loss for softmax units, fail to train the we moneed del totochange its 1990 1990). To function understand wh why y these otherand losscan functions can fail, examine output, even function when theitself. mo del makes highly confident incorrect predictions (Bridle, the softmax 1990). To understand why these other loss functions can fail, we need to examine Lik Likee the sigmoid, the softmax activ activation ation can saturate. The sigmoid function has the softmax function itself. a single output that saturates when its input is extremely negative or extremely Likee.the activation canare saturate. The sigmoid function has p ositiv ositive. Insigmoid, the casethe ofsoftmax the softmax, there multiple output values. These a single output that saturates when its input is extremely negative or extremely output values can saturate when the differences betw between een input values b ecome p ositiv e. In the case of the softmax, there are multiple values. These extreme. When the softmax saturates, many cost functionsoutput based on the softmax output values unless can saturate theinv differences betweenactiv input values b ecome also saturate, they arewhen able to invert ert the saturating activating ating function. extreme. When the softmax saturates, many cost functions based on the softmax o see thatunless the softmax function responds onds tosaturating the difference et etw ween its inputs, alsoTsaturate, they are able toresp invert the activbating function. observ observee that the softmax output is inv invarian arian ariantt to adding the same scalar to all of its T o see that the softmax function resp onds to the difference b etween its inputs, inputs: observe that the softmax output is zinv arian t to adding of its softmax( )= softmax( z + c)the . same scalar to all(6.32) inputs: Using this prop propert ert erty y, we can derive za)numerically softmax( = softmax(zstable + c). variant of the softmax: (6.32) Using this prop erty, wesoftmax( can derive a softmax( numerically of the softmax: z) = z −stable max ziv)ariant . (6.33) i
softmax( )= max zwith ). only small numerical (6.33) The reformulated version allows zus tosoftmax( ev evaluate aluatezsoftmax errors even when z con contains tains extremely large or − extremely negativ negativee num numb b ers. ExThe reformulated version allows us to ev aluate softmax with only small numerical amining the numerically stable variant, we see that the softmax function is driven z con errors when extremely large or extremely b y the even amount that its tains arguments deviate from maxi z i . negative numb ers. Examining the numerically stable variant, we see that the softmax function is driven softmax((z )i saturates to 1 when the corresp An output softmax corresponding onding input is maximal by the amount that its arguments deviate from max z . (zi = max i zi ) and zi is much greater than all of the other inputs. The output softmax (z ) saturates An output to 1 when the corresp onding input is maximal softmax softmax( ( z)i can zi is not also saturate to 0 when maximal and the maximum is z uch = max z z (m ) and is m uch greater than all of the other inputs. The output greater. This is a generalization of the way that sigmoid units saturate, and softmax( z) can also saturate to 0 when z is not maximal and the maximum is 185the way that sigmoid units saturate, and much greater. This is a generalization of
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
can cause similar difficulties for learning if the loss function is not designed to comp compensate ensate for it. can cause similar difficulties for learning if the loss function is not designed to The argumen argumentt to the softmax function can b e pro produced duced in tw twoo different wa ways. ys. comp ensate for it.z The most common is simply to hav havee an earlier lay layer er of the neural net network work output The argumen t to the softmax function can b e pro duced in tw o different ways. z > ev every ery element of z, as describ described ed ab aboove using the linear lay layer er z = W h + b . While The most commonthis is simply to hav e an earlier layer of the neural network output straigh straightforward, tforward, approach actually ov overparametrizes erparametrizes the distribution. The z z = W h + b ev ery element of , as describ ed ab o ve using the linear lay er . While constrain constraintt that the n outputs must sum to 1 means that only n − 1 parameters are straigh tforward, this approach overparametrizes theby distribution. n -th value necessary; the probability of theactually may b e obtained subtracting The the n n constrain t that the outputs m ust sum to 1 means that only 1 parameters are first n − 1 probabilities from 1. We can thus imp impose ose a requiremen requirementt that one element n -threquire necessary; the F probability of w the value may obtained by subtracting the − z b e fixed. of or example, e can that bzne = 0. Indeed, this is exactly first nthe1sigmoid probabilities from . We canPthus requiremen t that element (y =imp 1 | ose (z ) is equiv x) =a σ what unit do does. es. 1Defining equivalent alentone to defining of(yz = b− e1 fixed. or example, we can require that z z=and 0. Indeed, this isthe exactly | x) = Fsoftmax n −1 P softmax( (z )1 with z1 = 00.. Both a tw two-dimensional o-dimensional P ( y = 1 ) = σ ( z x what the sigmoid unit do es. Defining ) is equiv alent to defining argumen argumentt and the n argumen argumentt approaches to the softmax can describ describee the same n 1 P (yof=probability 1 x) = softmax (z ) withbut z and zdynamics. a tw o-dimensional = 0. BothInthe | set distributions, hav have e different learning practice, n difference argumen the argument bapproaches to the the ov softmax can describ e the or same |t and m − there is rarely uch et etw ween using overparametrized erparametrized version the set of probability distributions, but hav e different learning dynamics. In practice, restricted version, and it is simpler to implement the ov overparametrized erparametrized version. there is rarely much difference b etween using the overparametrized version or the From a neuroscientific p oin ointt of view, it is in interesting teresting to think of the softmax as restricted version, and it is simpler to implement the overparametrized version. a wa way y to create a form of comp competition etition b etw etween een the units that participate in it: the Fromoutputs a neuroscientific p oin it is interesting to think the softmax as softmax alw always ays sum tot 1ofsoview, an increase in the value of oneofunit necessarily a way toonds create form of comp etition etwothers. een theThis unitsisthat participate in lateral it: the corresp corresponds to aa decrease in the valuebof analogous to the softmax outputs alw ays sum to 1 so an increase in the v alue of one unit necessarily inhibition that is b elieved to exist b et etw ween nearb nearby y neurons in the cortex. At the corresp onds to athe decrease in the vween alue of This analogous to the lateral ai isand extreme (when difference b et etween theothers. maximal the others is large in inhibition that is b elieved to exist b et w een nearb y neurons in the cortex. At the magnitude) it b ecomes a form of winner-take-al winner-take-alll (one of the outputs is nearly 1 extreme (when the difference b et ween the maximal a and the others is large in and the others are nearly 0). magnitude) it b ecomes a form of winner-take-al l (one of the outputs is nearly 1 The name “softmax” can b e somewhat confusing. The function is more closely and the others are nearly 0). related to the argmax function than the max function. The term “soft” derives The can b e somewhat Theand function is more closely from thename fact “softmax” that the softmax function confusing. is con continuous tinuous differentiable. The related to the argmax function than the max function. The term “soft” derives argmax function, with its result represented as a one-hot vector, is not con continuous tinuous from the fact that the softmax function is con tinuous and differentiable. or differentiable. The softmax function thus pro provides vides a “softened” version ofThe the argmax function, with its result represented a one-hotfunction vector, isis not continuous softmax softmax( (z ) > z. argmax. The corresp corresponding onding soft version of theasmaximum or would differentiable. softmax provides a “softened” versionbut of the the It p erhapsThe b e better to function call the thus softmax function “softargmax,” softmax ( z ) z. argmax. The corresp onding softcon version of the maximum function is curren current t name is an en entrenched trenched conven ven vention. tion. It would p erhaps b e better to call the softmax function “softargmax,” but the current name is an entrenched convention. The linear, sigmoid, and softmax output units describ described ed ab abo ove are the most common. Neural net networks works can generalize to almost any kind of output lay layer er that The linear, sigmoid, and softmax output units describ ed ab o ve are the most we wish. The principle of maximum likelihoo likelihood d pro provides vides a guide for how to design common. Neural networks can generalize to almost any kind of output layer that we wish. The principle of maximum likelihoo d provides a guide for how to design 186
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
a go goo o d cost function for nearly any kind of output lay layer. er. ; θ ), the principle of In general, if we define a conditional distribution p ( a go o d cost function for nearly any kind of output layer.y | x maxim maximum um lik likeliho eliho elihoo o d suggests we use − log p(y | x; θ ) as our cost function. In general, if we define a conditional distribution p (y x; θ ), the principle of In general, we can think of the net network as; θrepresenting a function f( x; θ). maxim um likeliho o d suggests we neural use log pwork (y x ) as our | cost function. The outputs of this function are not direct predictions of the value y. Instead, − network| as representing a function f( x; θ). wevides can think of the neural f (xIn ;θ )general, = ω pro provides the parameters for a distribution ov over er y. Our loss function The outputs of this function are not direct predictions of the value y. Instead, can then b e interpreted as − log p(y ; ω (x)). f (x ;θ ) = ω provides the parameters for a distribution over y. Our loss function or example, we ma may yaswishlog toplearn can Fthen b e interpreted (y ; ω (the x)).variance of 2a conditional Gaussian for y , giv x given en . In the simple case, where the variance σ is a constant, there is a − to learn the variance of a conditional Gaussian for F or example, we mab yecause wish closed form expression the maximum likelihoo likelihood d estimator of variance is ysimply x . empirical σ , giventhe In the simple case, where the v ariance is a constant, there a y is mean of the squared difference b et etween ween observ observations ations and closed form expression b ecause the maximum likelihoo d estimator of v ariance is their exp expected ected value. A computationally more exp expensiv ensiv ensivee approach that do does es not simply the empirical mean ofco the squared difference ween observas ations require writing sp to simply includeb et the variance one yof and the special-case ecial-case code de is their exp ected v alue. A computationally more exp ensiv e approach that do es not prop properties erties of the distribution p( y | x) that is con controlled trolled by ω = f (x ; θ ). The require sp ecial-case include thea vcost ariance as one of the the ; ωis(xto negativ negativeewriting log-lik log-likeliho eliho elihoo o d − log pco(yde )) simply will then pro provide vide function with prop erties of the distribution ) that is con trolled b y ) . The p ( ω = f ( x ; θ y x appropriate terms necessary to mak makee our optimization pro procedure cedure incremen incrementally tally ( y ; ω ( x log p negativ e log-lik eliho o d )) will then pro vide a cost function with the | learn the variance. In the simple case where the standard deviation do does es not dep depend end appropriate necessary to mak e our optimization prothat cedure incremen tally − a new on the input,terms we can make parameter in the netw network ork is copied directly learn variance. In the simple case standard dovesrepresen not depting end σ itselfthe in into to ωthe . This new parameter might b e where or could b e adeviation parameter representing on the input, we can make a new parameter in the netw ork that is copied directly representing ting σ1 , dep depending ending on how we choose to σ 2 or it could b e a parameter β represen ω σ v represen in to . This new parameter might b e itself or could b e apredict parameter ting parametrize the distribution. We may wish our mo model del to a differen different t amount it could a parameter represen ending on how we choose to σ vor of ariance in byefor different vβalues of x .ting This ,isdep called a heter heterosc osc osceedastic mo model. del. parametrize the distribution. W may wish ourthe mo sp del to predict of a differen t amount In the heteroscedastic case, w eesimply make specification ecification the variance be y x of v ariance in for different v alues of . This is called a heter osc e dastic mo del. one of the values output by f ( x; θ). A typical way to do this is to formulate the In the heteroscedastic case, w e simply make sp ecification of described the variance be Gaussian distribution using precision, ratherthe than variance, as in Eq. ( xis ; θmost one output by fit ). A common typical wto ayuse to do this is toprecision formulate the 3.22 3.22..ofInthe thevalues multiv multivariate ariate case a diagonal matrix Gaussian distribution using precision, rather than variance, as described in Eq. 3.22. In the multivariate case it is most common to use a diagonal precision matrix diag (β ). (6.34) diag(βt).descen (6.34) This form formulation ulation works well with gradien gradient descentt b ecause the formula for the log-lik log-likeliho eliho elihoo o d of the Gaussian distribution parametrized by β in involv volv volves es only mulThis formulation works well with gradien t descen t b ecause the formulaaddition, for the βi . The tiplication by β i and addition of log gradient of multiplication, log-lik eliho o d of the Gaussian distribution parametrized volves only muland logarithm op operations erations is well-behav well-behaved. ed. By comparison,by if β weinparametrized the β log β tiplication by and addition of . The gradient of multiplication, addition, output in terms of variance, we would need to use division. The division function and logarithm op erations well-behav By comparison, if wecan parametrized the b ecomes arbitrarily steepisnear zero. ed. While large gradients help learning, output in terms variance,usually we would needintoinstability use division. division function arbitrarily large ofgradients result we parametrized the instability. . If The b ecomes arbitrarily steep near zero. While large gradients can help learning, output in terms of standard deviation, the log-likelihoo log-likelihood d would still inv involve olve division, arbitrarily large gradients usually result in instability . If we parametrized the and would also inv involv olv olvee squaring. The gradient through the squaring operation output in terms of standard deviation, the log-likelihoo d would still inv olve division, can vanish near zero, making it difficult to learn parameters that are squared. and would also involve squaring. The gradient through the squaring operation 187 to learn parameters that are squared. can vanish near zero, making it difficult
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Regardless of whether we use standard deviation, variance, or precision, we must ensure that the cov covariance ariance matrix of the Gaussian is positive definite. Because Regardless of whether use standard variance, we must the eigenv eigenvalues alues of the we precision matrixdeviation, are the recipro reciprocals cals or of precision, the eigenv eigenvalues alues of ensure that the cov ariance matrix of the Gaussian is positive definite. Because the cov covariance ariance matrix, this is equiv equivalen alen alentt to ensuring that the precision matrix is the eigenv alues of the precision matrix are the cals ofthe thediagonal eigenvalues of p ositiv ositivee definite. If we use a diagonal matrix, or arecipro scalar times matrix, the cov matrix, this is equiv alent toonensuring thatofthe is. then theariance only condition we need to enforce the output theprecision mo model del is pmatrix ositivity ositivity. p ositiv e definite. If we usethe a diagonal matrix, or the a scalar times thetodiagonal matrix, a is If we supp suppose ose that ra raw w activ activation ation of mo model del used determine the then the only condition we need to enforce on the output of the mo del is p ositivity diagonal precision, we can use the softplus function to obtain a p ositiv ositivee precision. a If w e supp ose that is the ra w activ ation of the mo del used to determine the vector: β = ζ( a). This same strategy applies equally if using variance or standard diagonal precision, we can use theorsoftplus obtainiden a ptity ositiv e precision deviation rather than precision if usingfunction a scalartotimes identity rather than vdiagonal ector: β matrix. = ζ( a). This same strategy applies equally if using variance or standard deviation rather than precision or if using a scalar times identity rather than It is rare to learn a cov covariance ariance or precision matrix with richer structure than diagonal matrix. diagonal. If the cov covariance ariance is full and conditional, then a parametrization must It is rare toguaran learn atees covpariance or precision of matrix with richer structure than b e chosen that guarantees ositiv ositive-definiteness e-definiteness the predicted cov covariance ariance matrix. diagonal. the cov is full aB parametrization must Σ(xand ) = conditional, B (x)B> (x) ,then This can b eIfachiev achieved ed ariance by writing where is an unconstrained b e chosen that guaran tees p ositiv covariance matrix. square matrix. One practical issuee-definiteness if the matrixofisthe fullpredicted rank is that computing the Σ ( x ) = B ( x ) B ( x ) B This can b e achiev ed by writing , where is an unconstrained 3 lik likeliho eliho elihoo o d is exp expensiv ensiv ensive, e, with a d × d matrix requiring O(d ) computation for the square matrix. One practical if the matrix is, full that computing ( x) (or determinan determinant t and inv inverse erse of Σissue equiv equivalently alently alently, andrank moreiscommonly done, the its d d O ( d lik eliho o d is exp ensiv e, with a matrix requiring ) computation for the eigendecomp eigendecomposition osition or that of B (x)). determinant and inverse of Σ( x)×(or equivalently, and more commonly done, its We often osition want toorp erform regression, that is, to predict real values eigendecomp that ofmultimodal B (x)). that come from a conditional distribution p ( y | x) that can ha have ve sev several eral differen differentt Weinoften want for to pthe erform multimodal that is, to predict mixture real values y space p eaks same value of xregression, . In this case, a Gaussian is y x p ( that come from a conditional distribution ) that can ha ve sev eral differen a natural representation for the output (Jacobs et al., 1991; Bishop, 1994). t y space p eaks in for the same mixtures value of xas . their In| this case, are a Gaussian mixture ise Neural netw networks orks with Gaussian output often called mixtur mixture a natural representation for the output ( Jacobs et al. , 1991 ; Bishop , 1994 ). density networks networks.. A Gaussian mixture output with n comp componen onen onents ts is defined by Neural networksprobability with Gaussian mixtures as their output are often called mixture the conditional distribution density networks. A Gaussian mixture output with n comp onents is defined by n the conditional probability X distribution (6.35) p(y | x) = p(c = i | x)N (y ; µ (i) (x), Σ (i)(x)). i=1
(6.35) p(y x) = p(c = i x) (y ; µ (x), Σ (x)). The neural net network work must hav havee three outputs: a vector defining p ( c = i | x ), a | | N matrix providing µ(i) (x) for all i, and a tensor providing Σ (i)( x) for all i. These The neural netsatisfy work must havconstraints: e three outputs: a vector defining p ( c = i x ), a outputs must different matrix providing µ (x) forX all i, and a tensor providing Σ ( x) for all i. |These outputs must satisfy different p( cconstraints: = i | x): these form a multinoulli distribution 1. Mixture comp components onents over the n differen differentt comp componen onen onents ts asso associated ciated with latent variable c, and can 1. Mixture comp onents p( c = i x): these form a multinoulli distribution Weoconsider be latent because we do in thelatent data: given input c, and ver the cntodifferen t comp onen ts| not assoobserve ciateditwith variable andtarget can , it is not possible to know with certainty which Gaussian component was responsible for , but we can imagine that was generated by picking one of them, and make that unobserved choice a random variable. 188
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
typically b e obtained by a softmax ov over er an n-dimensional vector, to guarantee that these outputs are p ositive and sum to 1. typically b e obtained by a softmax over an n-dimensional vector, to guarantee µ(i)(outputs x ): these 2. Means associated ciated with the i-th that these areindicate p ositive the andcenter sum toor1.mean asso Gaussian comp component, onent, and are unconstrained (typically with no nonlinearity µ ( x i-th 2. at Means ) : these units). indicateIf y the or mean ciated all for these output is center a d -v -vector, ector, then asso the netw network orkwith mustthe output Gaussian comp onent, and are unconstrained (typically with no nonlinearity an n × d matrix con containing taining all n of these d-dimensional vectors. Learning yeliho at all for these output units). is a do-v then the netw ork must output these means with maxim maximum um Iflik likeliho elihoo d ector, is slightly more complicated than an n d the matrix conof taining all n of these -dimensional vectors. learning means a distribution with donly one output mo mode. de. Learning We only these means withthe maxim eliho o d is slightly complicated w ant ×to up update date meanum forlik the comp componen onen onent t thatmore actually pro produced ducedthan the learning the means of a distribution with only one output mo de. W e only observ observation. ation. In practice, we do not kno know w which comp componen onen onentt pro produced duced eac each h w ant to up date the mean for the comp onen t that actually pro duced the observ observation. ation. The expression for the negative log-likelihoo log-likelihood d naturally weights observ ation. Incontribution practice, wetodothe not kno which comp onen t pro each eac each h example’s loss forweach comp component onent by the duced probability observ ation. The expression forthe theexample. negative log-likelihoo d naturally weights that the comp componen onen onent t pro produced duced each example’s contribution to the loss for each comp onent by the probability i)(x Σ(onen 3. Co Cov variances )t: pro these sp specify ecify cov covariance ariance matrix for each comp componen onen onentt that the comp duced the the example. i. As when learning a single Gaussian comp componen onen onent, t, we typically use a diagonal Σ ( x 3. matrix Covariances ) : these sp ecify the cov ariance each comp onent to av avoid oid needing to compute determinan determinants. ts. matrix As withfor learning the means i. As learning a single Gaussiandcomp onent, we typically use ato diagonal of thewhen mixture, maxim is complicated by needing assign maximum um likelihoo likelihood matrix to av oid needing to compute determinan ts. As with learning the means partial resp responsibility onsibility for eac each h p oin ointt to eac each h mixture comp componen onen onent. t. Gradient of the mixture, maxim um likelihoo d is complicated by needing tocorrect assign descen descentt will automatically follo follow w the correct pro process cess if given the partial resp onsibility for each p oint to eac h mixture comp onen t.del. Gradient sp specification ecification of the negative log-likelihoo log-likelihood d under the mixture mo model. descent will automatically follow the correct pro cess if given the correct ecification of the log-likelihoo d under the mo del. It hasspbeen rep reported orted thatnegative gradien gradient-based t-based optimization of mixture conditional Gaussian mixtures (on the output of neural netw networks) orks) can b e unreliable, in part b ecause one It has been rep orted gradien t-based of unstable conditional Gaussian gets divisions (by the that variance) which can optimization b e numerically (when some mixtures (on the output of neural netw orks) can b e unreliable, in part b ecause one variance gets to b e small for a particular example, yielding very large gradients). gets divisions (by the v ariance) which can b e numerically unstable (when some One solution is to clip gr gradients adients (see Sec. 10.11.1) while another is to scale the vgradien ariance gets to b e small for a particular example, yielding very large gradients). gradients ts heuristically (Murray and Laro Larochelle chelle , 2014 ). One solution is to clip gradients (see Sec. 10.11.1) while another is to scale the Gaussian mixture outputs are particularly effective in generativ generativee mo models dels of gradients heuristically (Murray and Laro chelle, 2014). sp speec eec eech h (Sc Schuster huster, 1999) or mo mov vements of physical ob objects jects (Grav Graves es, 2013). The Gaussian mixture outputs are particularly effective in generativ e mo dels of mixture density strategy giv gives es a way for the netw network ork to represent multiple output sp eec (Schuster , 1999the ) orvmo vements of output, physicalwhich ob jects Gravesfor , 2013 ). The mo modes desh and to control ariance of its is (crucial obtaining density givthese es a wreal-v ay foralued the netw ork to An represent multiple output amixture high degree of strategy quality in real-valued domains. example of a mixture mo desy and to control theinvariance densit density net network work is shown Fig. 6.4of . its output, which is crucial for obtaining a high degree of quality in these real-valued domains. An example of a mixture In ygeneral, weisma may y wish contin continue model del larger vectors y con containing taining more densit network shown in to Fig. 6.4.ue to mo variables, and to imp impose ose ric richer her and richer structures on these output variables. For y conof In general, we may wish to contin to mo taining more example, we ma may y wish for our neuralue netw network orkdel to larger outputvectors a sequence characters vthat ariables, anda to imp ose ric and richer on tinue these output ariables. For forms sen sentence. tence. Inherthese cases, structures w wee ma may y con continue to use vthe principle example, we ma y wish for our neural netw ork to output a sequence of c haracters of maximum likelihoo likelihood d applied to our model p(y ; ω( x )) )),, but the mo model del we use that forms a sentence. In these cases, we may continue to use the principle of maximum likelihoo d applied to our189 model p(y ; ω( x )), but the mo del we use
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.4: Samples drawn from a neural netw network ork with a mixture density output la layer. yer. The input x is sampled from a uniform distribution and the output y is sampled from Figure drawn from orknonlinear with a mixture density layer. pmodel (y6.4: . The neural netw network orkaisneural able tonetw learn mappings from output the input to | x)Samples x y The input is sampled from a uniform distribution and the output is sampled from the parameters of the output distribution. These parameters include the probabilities p verning (y xwhich ). Theofneural ork iscomponents able to learn nonlinear from to go governing three netw mixture will generatemappings the output asthe wellinput as the the parameters of themixture output component. distribution. Each Thesemixture parameters include isthe probabilities | for each parameters component Gaussian with governing mean whichand of three mixture willofgenerate thedistribution output as well as the predicted variance. All ofcomponents these asp aspects ects the output are able to parameters for each mixture mixture component is Gaussian with v ary with respect to the input component. x, and to do Each so in nonlinear wa ways. ys. predicted mean and variance. All of these asp ects of the output distribution are able to vary with respect to the input x, and to do so in nonlinear ways.
to describe y b ecomes complex enough to be b ey eyond ond the scope of this chapter. Chapter 10 describ describes es how to use recurrent neural netw networks orks to define such mo models dels y to describe b ecomes complex enough to be b ey ond the scope of this c hapter. over sequences, and Part I I I describ describes es adv advanced anced techniques for mo modeling deling arbitrary Chapter 10 describ es how to use recurrent neural netw orks to define such mo dels probabilit probability y distributions. over sequences, and Part I I I describ es advanced techniques for mo deling arbitrary probability distributions.
6.3
Hidden Units
So e ha have ve fo focused cused our discussion on design choices for neural netw networks orks that 6.3far wHidden Units are common to most parametric machine learning mo models dels trained with gradientSo far w e ha ve fo cused our discussion on design choices for to neural networks that based optimization. Now we turn to an issue that is unique feedforward neural are common parametric mo dels trained with gradientnet networks: works: ho how wtotomost cho hoose ose the type machine of hiddenlearning unit to use in the hidden la layers yers of the based optimization. Now w e turn to an issue that is unique to feedforward neural mo model. del. networks: how to cho ose the type of hidden unit to use in the hidden layers of the The design of hidden units is an extremely active area of research and do does es not mo del. yet ha have ve man many y definitiv definitivee guiding theoretical principles. The design of hidden units is an extremely active area of research and do es not unitse are an excellen excellent t default choice of hidden unit. Many other yet Rectified have manlinear y definitiv guiding theoretical principles. typ ypes es of hidden units are av available. ailable. It can b e difficult to determine when to use Rectified linear units are an excellen t default choice an of hidden unit.choice). Many other whic which h kind (though rectified linear units are usually acceptable We typ es of hidden units are available. It can b e difficult to determine when to use which kind (though rectified linear units 190 are usually an acceptable choice). We
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
describ describee here some of the basic intuitions motiv motivating ating each type of hidden units. These intuitions can b e used to suggest when to try out each of these units. It is describ eimp here sometoofpredict the basic intuitions motiv ating type hiddenpro units. usually impossible ossible in adv advance ance which will workeach b est. Theofdesign process cess These intuitions can b e used to suggest when to try out each of these units. It is consists of trial and error, in intuiting tuiting that a kind of hidden unit may work well, usually imp ossible to predict in adv ance which will w ork b est. The design pro cess and then training a net network work with that kind of hidden unit and ev evaluating aluating its consists of trial error, inset. tuiting that a kind of hidden unit may work well, p erformance on aand validation and then training a network with that kind of hidden unit and evaluating its Some of the hidden units included in this list are not actually differentiable at p erformance on a validation set. all input p oints. For example, the rectified linear function g (z ) = max max{ {0 , z } is not Some of the hidden units included in this list are not actually differentiable at differen differentiable tiable at z = 00.. This may seem lik likee it inv invalidates alidates g for use with a gradientg (zp)erforms = max w0ell all input p oints.algorithm. For example, the rectified lineardescent function is not , z enough based learning In practice, gradient still differen tiable at z to =b 0.e This seem likelearning it invalidates for use with { a gradient} for these mo models dels used may for machine tasks.g This is in part because based learning In practice, descent stillatp erforms well enough neural netw network orkalgorithm. training algorithms dogradient not usually arrive a lo local cal minimum of for these mo dels to b e used for machine learning tasks. This is in part because the cost function, but instead merely reduce its value significan significantly tly tly,, as shown in neural ork training do usually arrive at8. aBecause lo cal minimum of Fig. 4.3netw . These ideas willalgorithms b e describ described ed not further in Chapter we do not the costtraining function, but instead merely reduce value significan shown in exp expect ect to actually reach a p oin oint t whereitsthe gradient is 0 ,tly it ,isasacceptable Fig.the 4.3minima . Theseofideas willfunction b e describ further 8. Because do not for the cost to ed corresp correspond ondin toChapter p oints with undefinedwegradient. 0 exp ect training to actually reach a p oin t where the gradient is , it is acceptable Hidden units that are not differentiable are usually non-differen non-differentiable tiable at only a for thenum minima of pthe cost function to acorresp ondgto (z )p oints small numb b er of oin oints. ts. In general, function has awith left undefined deriv derivative ativegradient. defined Hidden units that are not differentiable are usually non-differen tiable at only by the slope of the function immediately to the left of z and a right deriv derivativ ativ ativeea small num er ofslop p oin ts. the In general, function g (z )tohas left deriv defined defined bybthe slope e of functionaimmediately thea right of z .ative A function z b y the slope of the function immediately to the left of and a right deriv ativ e is differen differentiable tiable at z only if b oth the left deriv derivative ative and the right deriv derivativ ativ ative e are defined b y the slopto e of the other. function immediately to the of ztext . A of function defined and equal each The functions used in right the con context neural z is differen tiable at only if b oth the left deriv ative and the right deriv ativ e are net networks works usually ha have ve defined left deriv derivatives atives and defined right deriv derivativ ativ atives. es. In the defined and equal to each other. The functions used in the con text of neural max{ {0 , z }, the left deriv case of g (z ) = max derivative ative at z = 0 is 0 and the right deriv derivativ ativ ativee net works usually ha ve defined left deriv atives and defined right deriv ativ es. In is 1. Soft Software ware implemen implementations tations of neural netw network ork training usually return onethe of 0 , zes, rather maxativ caseone-sided of g (z ) =deriv the leftthan deriv ative at zthat = 0the is 0deriv and ative the right derivativor e the derivativ atives rep reporting orting derivative is undefined is 1. Soft implemen of neural netw ork training usually that return one of { ma } tations raising anware error. This may y b e heuristically justified by observing gradien gradienttthe one-sided deriv ativ es rather than rep orting that the deriv ative is undefined or based optimization on a digital computer is sub subject ject to nu numerical merical error an anyw yw yway ay ay.. raising an error. This ma y b e heuristically justified b y observing that gradien When a function is asked to ev evaluate aluate g(0) (0),, it is very unlikely that the underlyingtoptimization on a digital sub ject to vnu merical anyway. thaterror vbased alue truly was 0. Instead, it was computer lik small alue was rounded likely ely to b eissome When a function is asked to ev aluate g(0),pleasing it is veryjustifications unlikely that underlying to 0. In some con contexts, texts, more theoretically arethe av available, ailable, but vthese alue truly was 0 . Instead, it w as lik ely to b e some small v alue that was rounded usually do not apply to neural net network work training. The imp importan ortan ortantt p oin ointt is that to 0 . In some con texts, more theoretically pleasing justifications are av ailable, but in practice one can safely disregard the non-differen non-differentiability tiability of the hidden unit theseation usually do not apply toedneural activ activation functions describ described b elo elow. w.network training. The imp ortant p oint is that in practice one can safely disregard the non-differentiability of the hidden unit Unless indicated otherwise, most hidden units can b e describ described ed as accepting activation functions describ ed b elow. a vector of inputs x, computing an affine transformation z = W >x + b, and indicated otherwise, hidden units can describ ed as accepting thenUnless applying an elemen element-wise t-wise most nonlinear function hidden units are g( z)b.e Most x z = W x + b,ation a vector of inputs , computing an affine transformation and distinguished from eac each h other only by the choice of the form of the activ activation then applying function g (z ). an element-wise nonlinear function g( z). Most hidden units are distinguished from each other only by the choice of the form of the activation 191 function g (z ).
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Rectified linear units use the activ activation ation function g (z ) = max{0, z }.
Rectified linear units optimize b ecause are0so to linear Rectified linear units useare theeasy activto ation function g (z ) they = max , z similar . units. The only difference b et etween ween a linear unit and a rectified linear unit is { so }similar to linear linear units areoutputs easy to zero optimize b ecause they are thatRectified a rectified linear unit across half its domain. This makes the units. The only difference b et ween a linear unit and a rectified is deriv derivatives atives through a rectified linear unit remain large whenev whenever er the linear unit isunit active. thatgradients a rectifiedarelinear unit large outputs half its domain. Thisative makes the The not only but zero also across consisten consistent. t. The second deriv derivative of the deriv atives through a rectified linear unit remain large whenev er the unit is active. rectifying op operation eration is 0 almost ev everywhere, erywhere, and the deriv derivative ative of the rectifying The gradients are not only large but also consisten t. The second derivthe ative of the op operation eration is 1 everywhere that the unit is activ active. e. This means that gradient rectifyingisop eration is 0 almost everywhere, the bderiv ative ofation the rectifying direction far more useful for learning than itand would e with activ activation functions op eration is 1 everywhere that the unit is activ e. This means that the gradient that introduce second-order effects. direction is far more useful for learning than it would b e with activation functions linear units are effects. typically used on top of an affine transformation: thatRectified introduce second-order > on top of an affine transformation: Rectified linear units are typically used h = g (W x + b). (6.36)
h =ofgthe (W affine x + btransformation, ). When initializing the parameters it can b e a(6.36) go goo od practice to set all elements of b to a small, p ositive value, such as 0.1. This makes When of the affine transformation, canmost b e ainputs go o d it veryinitializing lik likely ely that the the parameters rectified linear units will b e initially activeit for b toderiv practice to set all of the a small, p ositive such as 0.1. This makes in the training setelements and allow derivatives atives to passvalue, through. it very likely that the rectified linear units will b e initially active for most inputs Sev Several eral generalizations rectified lineartounits Most of these generalin the training set and allowofthe derivatives pass exist. through. izations p erform comparably to rectified linear units and o ccasionally p erform Several generalizations of rectified linear units exist. Most of these generalb etter. izations p erform comparably to rectified linear units and o ccasionally p erform One drawback to rectified linear units is that they cannot learn via gradientb etter. drawback based methods on examples for which their activ activation ation is zero. A v variety ariety of One drawbac k to rectified linear units is that they cannot learn via gradientgeneralizations of rectified linear units guarantee that they receive gradient ev everyerybased methods on examples for which their activ ation is zero. A v ariety of where. generalizations of rectified linear units guarantee that they receive gradient everyThree generalizations of rectified linear units are based on using a non-zero where. max(0 (0, zi) + α i min min(0 (0, zi ). Absolute value slop slopee α i when zi < 0: hi = g ( z , α) i = max Three generalizations of rectified linear units are based using a non-zero rectific ctification ation fixes α i = −1 to obtain g (z) = |z |. It is used foronob object ject recognition α z < h = g ( z , α ) = max (0 , z ) + α min (0 , z slop e when 0 : ) . A bsolute value from images (Jarrett et al., 2009), where it mak makes es sense to seek features that are αolarity = 1rev g (z)input = z illumination. rinv evctific ation fixes toersal obtain . It is used for ob ject recognition in ariant under ap reversal of the Other generalizations from imageslinear (Jarrett et− al., more 2009),broadly where it mak toaky seek features that | | es sense of rectified units are applicable. A le leaky ReLU (Maas et are al., in v ariant under a p olarity rev ersal of the input illumination. Other generalizations 2013 2013)) fixes αi to a small value like 0.01 while a par arametric ametric ReLU or PR PReLU eLU treats of rectified linear units are more broadly applicable. A le aky R eLU ( Maas et al., αi as a learnable parameter (He et al., 2015). 2013) fixes α to a small value like 0.01 while a parametric ReLU or PReLU treats Maxout units (Go Gooo dfello dfellow w et al. al.,, 2013a) generalize rectified linear units further. α as a learnable parameter (He et al., 2015). Instead of applying an elemen element-wise t-wise function g (z ), maxout units divide z in into to Maxout units ( Go o dfello w et al. , 2013a ) generalize rectified linear units further. groups of k values. Each maxout unit then outputs the maximum element of one Instead of applying an element-wise function g (z ), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one 192
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of these groups: of these groups:
g (z ) i = max z j j∈G
(6.37)
g (z ) = max z (6.37) 1)k k + 1, . . . , ik where G (i) is the indices of the inputs for group i , { (i − 1) ik} } . This pro provides vides way y of learning a piecewise linear function that resp responds onds to multiple G a wa i ( 1) k + 1, . . . , ik . This where is the indices of the inputs for group , i directions in the input x space. provides a way of learning a piecewise linear function { that − resp onds to m } ultiple k pieces. A maxout unit can learn a piecewise linear, con conv v ex function with up to directions in the input x space. Maxout units can thus b e seen as itself rather k pieces. A maxout unit can learn a piecewise linear, con v ex function with up to unit than just the relationship b etw etween een units. With large enough k , a maxout can Maxout units can thus b e seen as itself rather learn to appro approximate ximate an any y conv convex ex function with arbitrary fidelit fidelity y. In particular, k than just the relationship b etw een units. With large enough , a maxout unit a maxout la layer yer with tw two o pieces can learn to implement the same function of can the learn x toasappro ximate an y er conv ex function with arbitrary fidelit y. In particular, input a traditional lay layer using the rectified linear activ activation ation function, absolute maxout layer with two pieces can leaky learn to sameorfunction of the vaalue rectification function, or the or implement parametricthe ReLU, can learn to x input as a traditional lay er using the rectified linear activ ation function, absolute implemen implementt a totally different function altogether. The maxout lay layer er will of course vbalue rectification function, or the leaky or parametric ReLU, or to e parametrized differently from an any y of these other la layer yer types, so can the learn learning implementwill a totally different function altogether. The maxout will of course dynamics b e differen different t even in the cases where maxout learnslay toerimplement the b e parametrized differently from an y of these other la yer types, so the learning same function of x as one of the other lay layer er types. dynamics will b e different even in the cases where maxout learns to implement the Eac Each h maxout unit is now parametrized by k weight vectors instead of just one, same function of x as one of the other layer types. so maxout units typically need more regularization than rectified linear units. They k weight h maxout unit isregularization now parametrized just one, can Eac work well without if the by training set visectors large instead and theofnumber of so maxout units typically need more regularization than rectified linear units. They pieces p er unit is kept lo low w (Cai et al., 2013). can work well without regularization if the training set is large and the number of Maxout units have few(Cai other b enefits. pieces p er unit is hav kepte alow et al. , 2013).In some cases, one can gain some statistical and computational adv advantages antages by requiring few fewer er parameters. Sp Specifically ecifically ecifically,, Maxout units have a few b enefits. Infilters some cases, can gain some staif the features captured by nother differen different t linear can b eone summarized without tisticalinformation and computational antages byerrequiring fewof erkparameters. Sp ecifically losing by takingadv the max ov over each group features, then the next, n if the features captured by differen t linear filters can b e summarized without la layer yer can get by with k times fewer weights. losing information by taking the max over each group of k features, then the next Because unit kis times drivenfewer by mw ultiple havee some redunlayer can geteach by with eights.filters, maxout units hav dancy that helps them to resist a phenomenon called catastr atastrophic ophic for forgetting getting in Because each unit is driven b y m ultiple filters, maxout units hav e some whic which h neural netw networks orks forget how to p erform tasks that they were trainedredunon in dancy that helps them to resist a phenomenon called c atastr ophic for getting in the past (Go Goo o dfello dfellow w et al. al.,, 2014a). which neural networks forget how to p erform tasks that they were trained on in and all of).these generalizations of them are based on the the Rectified past (Go olinear dfellounits w et al. , 2014a principle that mo models dels are easier to optimize if their behavior is closer to linear. Rectified linear units andofall of these of them are optimization based on the This same general principle using lineargeneralizations b ehavior to obtain easier principle that mo dels are easier to optimize if their behavior is closer to linear. also applies in other con contexts texts b esides deep linear netw networks. orks. Recurrent netw networks orks can This principle of using linear b ehavior to obtain easier optimization learnsame from general sequences and pro a sequence of states and outputs. When training produce duce also applies in other contexts b esides deepthrough linear netw orks.time Recurrent networks can them, one needs to propagate information sev several eral steps, which is much learn from andcomputations pro duce a sequence of states and outputs. When btraining easier whensequences some linear (with some directional deriv derivatives atives eing of them, one needs to propagate information through sev eral time steps, which is m magnitude near 1) are inv involv olv olved. ed. One of the best-p best-performing erforming recurren recurrentt net netw wuch ork easier when some linear computations (with some directional derivatives b eing of magnitude near 1) are involved. One 193 of the best-p erforming recurrent network
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
arc architectures, hitectures, the LSTM, propagates information through time via summation—a particular straightforw straightforward ard kind of suc such h linear activ activation. ation. This is discussed further arc hitectures, the LSTM, propagates information through time via summation—a in Sec. 10.10. particular straightforward kind of such linear activation. This is discussed further in Sec. 10.10. Prior to the introduction of rectified linear units, most neural netw networks orks used the logistic sigmoid activ activation ation function Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid activation functiong (z ) = σ (z ) (6.38) g (zfunction ) = σ (z ) or the hyperb hyperbolic olic tangent activ activation ation
(6.38)
or the hyperb olic tangent activation g (z ) function = tanh(z ).
(6.39)
g (z ) = tanh(zb)ecause . These activ activation ation functions are closely related tanh(z ) = 2σ(2z ) − 1(6.39) .
We ha have ve already seen sigmoid units as output units, used to predict the These activation functions are closely related b ecause tanh(z ) = 2σ(2z ) 1. probabilit probability y that a binary variable is 1. Unlike piecewise linear units, sigmoidal − W e ha ve already seen of sigmoid units as output units, toused to predict the units saturate across most their domain—they saturate a high value when probabilit y that a binary v ariable is 1 . Unlike piecewise linear units, sigmoidal z is very p ositive, saturate to a low value when z is very negative, and are only units saturate across mostinput of their domain—they to a high value when z is near 0. saturate strongly sensitive to their when The widespread saturation of zsigmoidal z is very punits ositive, saturate to a low v alue when is very negative, and are only can mak makee gradient-based learning very difficult. For this reason, strongly sensitive to theirininput when z isnetw near 0. is The widespread saturation of their use as hidden units feedforward networks orks no now w discouraged. Their use sigmoidal units can mak e gradient-based learning very difficult. F or this reason, as output units is compatible with the use of gradien gradient-based t-based learning when an their use as hidden units in feedforward netw orks is no w discouraged. Their use appropriate cost function can undo the saturation of the sigmoid in the output as output units is compatible with the use of gradient-based learning when an la layer. yer. appropriate cost function can undo the saturation of the sigmoid in the output When a sigmoidal activ activation ation function must b e used, the hyperb hyperbolic olic tangent layer. activ activation ation function typically p erforms b etter than the logistic sigmoid. It resembles ay sigmoidal activclosely ation function must that b e used, the = hyperb olicσ (0) tangent tanh(0) = 12 . the When identit function more 0 while identity closely, , in the sense activationtanh function typically p erforms b etter the logistic Itwork resembles yˆ = Because is similar to iden identity tity near 0, than training a deep sigmoid. neural net network tanh σ (0) = the identit y function more closely , in the sense that (0) = 0 while > > > > > > w tanh tanh((U tanh tanh((V x)) resembles training a linear model yˆ = w U V x so. tanh = Because is ations similaroftothe iden tity near a deep work yˆthe long as the activ activations netw network ork can0b, etraining kept small. Thisneural mak makes esnet training w (Uorktanh (V x)) resembles training a linear model yˆ = w U V x so tanhtanh netw easier. network long as the activations of the network can b e kept small. This makes training the activ activation ation functions are more common in settings other than feedtanhSigmoidal network easier. forw forward ard netw networks. orks. Recurren Recurrentt netw networks, orks, man many y probabilistic mo models, dels, and some Sigmoidal activ ation functions are more common in settings other feedauto autoencoders encoders ha have ve additional requirements that rule out the use of than piecewise forward netw orks. Recurren t netw many units probabilistic mo dels,despite and some linear activ activation ation functions and makeorks, sigmoidal more app appealing ealing the auto encoders ha ve additional requirements that rule out the use of piecewise dra drawbacks wbacks of saturation. linear activation functions and make sigmoidal units more app ealing despite the drawbacks of saturation. 194
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Man Many y other types of hidden units are p ossible, but are used less frequently frequently.. general, a wide variet ariety y of differentiable functions p erform p erfectly ManIn y other types of hidden units are p ossible, but are used less frequently . well. Man Many y unpublished activ activation ation functions p erform just as well as the p opular ones. In general, a wide v ariet y ofthe differentiable functions p erform net p erfectly well. To pro a concrete example, authors tested a feedforward provide vide network work using Man y unpublished activ functions just as an well as the ones. h = cos cos( (W x + b) on theation MNIST datasetp erform and obtained error ratep opular of less than To pro videisa comp concrete example, the authors feedforward network using 1%, which competitiv etitiv etitive e with results obtainedtested using amore con conven ven ventional tional activ activation ation h = cos ( W x + b ) on the MNIST dataset and obtained an error rate of less than functions. During researc research h and dev development elopment of new tec techniques, hniques, it is common 1%, which is comp etitiv withation results obtainedand using ventional activation to test many differen different t eactiv activation functions findmore thatcon several variations on functions.practice Duringp erform researchcomparably and development of new hniques, it hidden is common standard comparably. . This means thattecusually new unit test many differen t activ ation and find thattoseveral on tto yp ypes es are published only if they arefunctions clearly demonstrated providevariations a significant standard practice p erform comparably . This means that usually new hidden unit impro improvemen vemen vement. t. New hidden unit types that p erform roughly comparably to known ttyp yp es are published if they are clearly demonstrated to provide a significant are so commononly as to b e uninteresting. ypes es improvement. New hidden unit types that p erform roughly comparably to known It would b e impractical to list all of the hidden unit types that hav havee app appeared eared typ es are so common as to b e uninteresting. in the literature. We highlight a few esp especially ecially useful and distinctiv distinctivee ones. It would b e impractical to list all of the hidden unit types that have app eared (z ) at all. One can also think of Oneliterature. p ossibilityWis to not hav have an activ activation ation guseful in the e highlight aefew esp ecially and distinctive ones. this as using the iden identity tity function as the activ activation ation function. We ha hav ve already g ( z One p ossibility is to not hav e an activ ation ) at all. One can also of seen that a linear unit can be useful as the output of a neural net network. work.think It may this base using identityunit. function as lay theeractiv ation function. e have already also used asthe a hidden If every layer of the neural netw network orkWconsists of only seen that a linear unit can be useful as the output of a neural net work. Iter, may linear transformations, then the netw network ork as a whole will b e linear. Ho Howev wev wever, it also b e used as hidden unit. If every lay er of the neural netw ork consists of only is acceptable fora some lay of the neural net to b e purely linear. Consider layers ers network work transformations, then nthe netwand ork pasoutputs, a wholehwill > x + bHowever, it = gb(eWlinear. alinear neural netw network ork lay layer er with inputs ). We ma may y is acceptable for some lay ers of the neural net work to b e purely linear. Consider replace this with tw two o lay layers, ers, with one lay layer er using weight matrix U and the other n p outputs, h = gfunction, ( W x +then b ). W a neural netw ork lay er with inputs and ma y using weigh weightt matrix V . If the first lay layer er has no activ activation ation wee ha have ve U and replace thisfactored with twothe layers, with one lay using weightlay matrix W . other essen essentially tially weigh weight t matrix oferthe original layer er based on the The V using weigh t matrix . If the first lay er has no activ ation function, then we have > > factored approach is to compute h = g(V U x + b ). If U pro produces duces q outputs, essenU tially the contain weight matrix original layer based . The W V together + pthe ) q parameters, W on np then andfactored only (n of while con contains tains h = g ( V U x + b U q factored approach is to compute ) . If pro duces outputs, parameters. For small q , this can b e a considerable sa saving ving in parameters. It + p) qtransformation np then Uatand together contain only parameters, while con tains but comes theVcost of constraining the(n linear to b eWlo low-rank, w-rank, parameters. small q , this a considerable ving in parameters. It these low-rankFor relationships are can oftenb esufficient. Linear sa hidden units thus offer an comes at cost of constraining transformation to b e low-rank, but effectiv effective e wthe ay of reducing the num numb bthe er oflinear parameters in a net network. work. these low-rank relationships are often sufficient. Linear hidden units thus offer an Softmax units are another kind of unit that is usually used as an output (as effective way of reducing the numb er of parameters in a network. describ described ed in Sec. 6.2.2.3) but ma may y sometimes b e used as a hidden unit. Softmax Softmax units are another kind of unit that is ousually used asvan output k units naturally represent a probabilit probability y distribution ver a discrete ariable with(as describ ed in Sec. 6.2.2.3 ) but ma y sometimes b e used as a hidden unit. Softmax p ossible values, so they may b e used as a kind of switch. These kinds of hidden k units are naturally a probabilit y distribution over a discrete variablelearn withto units usuallyrepresent only used in more adv advanced anced architectures that explicitly p ossible values, so they mayedb einused a kind manipulate memory memory, , describ described Sec.as10.12 . of switch. These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory, describ ed in Sec. 10.12. 195
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
A few other reasonably common hidden unit types include: A few other reasonably common hidden unit types include: • Radial bbasis ||W W:,i − x||2 . This asis function or RBF unit: hi = exp −σ1 || function b ecomes more active as x approac approaches hes a template W :,i. Because it W x Radial basis function or RBF unit: h = exp . This saturates to 0 for most x, it can b e difficult to optimize. • function b ecomes more active as x approaches a template − || W− . ||Because it a ( a)0 = a ) = log + e b).e This • Softplus saturates forζ (most x, (1 it can difficult optimize. Softplus:: gto is a to smo smooth oth version of the rectifier, in by Dugas et al. ( 2001 ) for function approximation by Nair intro tro troduced duced oth version of and rectifier, g ( a ) = ζ ( a ) = log (1 + e Softplus : ) . This is a smo the and Hinton (2010) for the conditional distributions of undirected probabilistic in tro duced by Dugas al. ()2001 ) for function approximation and byfound Nair • mo models. dels. Glorot et al. et (2011a compared the softplus and rectifier and and Hinton ( 2010 ) for the conditional distributions of undirected probabilistic b etter results with the latter. The use of the softplus is generally discouraged. mo dels. Glorot et al. (2011athat ) compared the softplus rectifier found The softplus demonstrates the p erformance of and hidden unit and types can b etter withtuitiv the latter. use of the is generally discouraged. b e veryresults counterin counterintuitiv tuitive—one e—oneThe might exp expect ectsoftplus it to hav have e an adv advan an antage tage over Therectifier softplusdue demonstrates that tiable the p erformance types less can the to b eing differen differentiable ev everywhere erywhereof orhidden due to unit saturating b e v ery counterin tuitiv e—one might exp ect it to hav e an adv an tage o ver completely completely,, but empirically it do does es not. the rectifier due to b eing differentiable everywhere or due to saturating less • Har completely but empirically it do es not. Hard d tanh :, this is shap shaped ed similarly to the tanh and the rectifier but unlik unlikee min(1 (1, a)) the latter, it is b ounded, g(a) = max (− 1, min )).. It was introduced tanh tanh Har d : this is shap ed similarly to the and the rectifier but unlike by Collob ( 2004 ). Collobert ert • the latter, it is b ounded, g(a) = max ( 1, min(1, a)). It was introduced by Collob (2004 ). Hidden unit ert design remains an activ activee area − of researc research h and man many y useful hidden unit types remain to b e disco discovered. vered. Hidden unit design remains an active area of research and many useful hidden unit types remain to b e discovered.
6.4
Arc Architecture hitecture Design
Another key design consideration for neural netw networks orks is determining the architecture. 6.4 Arc hitecture Design The word ar archite chite chitectur ctur cturee refers to the ov overall erall structure of the net netw work: ho how w many Another key design consideration for neural netw orks is determining the architecture. units it should ha have ve and how these units should b e connected to each other. The word architecture refers to the overall structure of the network: how many Most neuralhanet networks are these organized groups of units called lay layers. ers. Most units it should veworks and how units into should b e connected to each other. neural netw network ork architectures arrange these lay layers ers in a chain structure, with eac each h neural networks organized into groups units called the layers. la layer yerMost b eing a function of theare la layer yer that preceded it. In of this structure, first Most la layer yer neural is givennetw by ork architectures arrangethese layers in a chain structure, with each layer b eing a function of the(1) layer that preceded it. In this structure, the first layer (6.40) h = g(1) W (1)> x + b(1) , is given by the second lay layer er is giv given en bhy = g , (6.40) W x+b the second layer is given hb(2) y = g(2) W (2)>h (1) + b(2) , (6.41) and so on. and so on.
h
=g
W
196
h
+b ,
(6.41)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
In these chain-based arc architectures, hitectures, the main architectural considerations are to cho hoose ose the depth of the net network work and the width of each lay layer. er. As we will see, In these chain-based arc hitectures, the main architectural considerations are a netw network ork with even one hidden lay layer er is sufficien sufficientt to fit the training set. Deep Deeper er to c ho ose the depth of the net work and the width of each lay er. As we will see, net networks works often are able to use far few fewer er units p er la layer yer and far fewer parameters a netw ork with even one hidden lay er is sufficien t to fit harder the training set. Deep er and often generalize to the test set, but are also often to optimize. The networks often are able tofor usea far p er lavia yerexp and far fewer parameters ideal netw network ork architecture taskfew mer ustunits b e found experimentation erimentation guided by and often generalize to the test set, but are also often harder to optimize. The monitoring the validation set error. ideal network architecture for a task must b e found via exp erimentation guided by monitoring the validation set error. A linear mo model, del, mapping from features to outputs via matrix multiplication, can by definition represen representt only linear functions. It has the adv advantage antage of b eing easy to A linear mo del, mapping from features to outputs via matrix multiplication, can train b ecause many loss functions result in conv convex ex optimization problems when by definition represen t only linear functions. It has the antage of b eing easy to applied to linear mo models. dels. Unfortunately Unfortunately, , we often wan want t toadv learn nonlinear functions. train b ecause many loss functions result in convex optimization problems when At first glance, we migh mightt presume that learning a nonlinear function requires applied to linear mo dels. Unfortunately, we often want to learn nonlinear functions. designing a sp specialized ecialized mo model del family for the kind of nonlinearity we wan wantt to learn. A t first glance, we migh t presume that learning a nonlinear function Fortunately ortunately,, feedforward netw networks orks with hidden la layers yers pro provide vide a univ universal ersal requires appro approxixidesigning a sp ecialized mo del family for the kind of nonlinearity we wan t to learn. mation framew framework. ork. Sp Specifically ecifically ecifically,, the universal appr approximation oximation the theor or orem em (Hornik et al., F ortunately , feedforward netw orks with hidden la yers pro vide a univ appro xi1989; Cyb Cybenko enko, 1989) states that a feedforward netw network ork with a linearersal output la layer yer mation framew ork. Sp ecifically the universal approximation theorfunction em (Hornik eth al. and at least one hidden lay layer er ,with any “squashing” activ activation ation (suc (such as, 1989 ; Cyb enko , 1989)activ states thatfunction) a feedforward network withany a linear layer the logistic sigmoid activation ation can approximate Boreloutput measurable and at least layer with any “squashing” activ ation (such as function fromone onehidden finite-dimensional space to another with an any yfunction desired non-zero the logistic sigmoid activ ation function) can approximate any Borel measurable amoun amountt of error, provided that the net netw work is given enough hidden units. The function from one finite-dimensional space to another with an desired non-zero deriv of the feedforward netw can also appro they deriv of the derivatives atives network ork approximate ximate derivatives atives amoun t of error, provided that the net w ork is given enough hidden units. The function arbitrarily well (Hornik et al., 1990). The concept of Borel measurability deriv atives the of the feedforward approximate the deriv atives the is b eyond scop scope e of this bnetw o ok;ork forcan ouralso purposes it suffices to say thatof any function arbitrarily well ( Hornik et al. , 1990 ). The concept of Borel measurability con continuous tinuous function on a closed and b ounded subset of Rn is Borel measurable is b thema scop this b o ok; for purposes it suffices to say that any and eyond therefore approximated by our a neural net may y bee of network. work. network ork may RA neural netw continuous function a closed and b ounded subset is Boreldiscrete measurable also appro approximate ximate an any yon function mapping from an any y finite of dimensional space and therefore ma y b e approximated b y a neural net work. A neural netw ork may to another. While the original theorems were first stated in terms of units with also appro ximate an y function mapping from an y finite dimensional discrete space activ activation ation functions that saturate b oth for very negativ negativee and for very p ositive to another. While the original theorems w ere first in terms of for units with argumen arguments, ts, universal appro approximation ximation theorems hav havee stated also b een prov proven en a wider activation functions that saturate b oth forthevery e and forrectified very p ositive class of activ activation ation functions, whic which h includes no now wnegativ commonly used linear argumen ts, universal appro ximation theorems hav e also b een prov en for a wider unit (Leshno et al., 1993). class of activation functions, which includes the now commonly used rectified linear universal approximation unitThe (Leshno et al.approxim , 1993). ation theorem means that regardless of what function we are trying to learn, we kno know w that a large MLP will b e able to this The universal ation theorem regardless of what function. How However, ever,approxim we are not guaran guaranteed teed means that thethat training algorithm willfunction b e able w learn, we knoifwthe that a large MLP will b e able this toe are trying that to function. Even MLP is able to represent theto function, learning function. How ever, we are not guaran teed that the training algorithm will b e able can fail for tw twoo differen differentt reasons. First, the optimization algorithm used for training to that function. Even if the MLP is able to represent the function, learning 197 optimization algorithm used for training can fail for two different reasons. First, the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
ma may y not b e able to find the value of the parameters that corresp corresponds onds to the desired function. Second, the training algorithm might choose the wrong function due to y not b e able to find value of the that corresp onds to sho thews desired oma verfitting. Recall fromthe Sec. 5.2.1 thatparameters the “no free lunch” theorem shows that function. Second, the training algorithm might choose the wrong function due to there is no universally sup superior erior machine learning algorithm. Feedforw eedforward ard netw networks orks o verfitting. Recall from Sec. 5.2.1 that the “no free lunch” theorem sho ws that pro provide vide a universal system for represen representing ting functions, in the sense that, giv given en a there is nothere universally erior machine learning algorithm. Feedforw ard netw orks function, exists a sup feedforward netw network ork that approximates the function. There prono vide a universal systemfor forexamining representing functions, senseexamples that, given a is universal pro procedure cedure a training setinofthe sp specific ecific and existsthat a feedforward network the function. cfunction, ho hoosing osing athere function will generalize to pthat ointsapproximates not in the training set. There is no universal pro cedure for examining a training set of sp ecific examples and The universal approximation theorem says that there exists a net network work large cho osing a function that will generalize to p oints not in the training set. enough to ac achieve hieve any degree of accuracy we desire, but the theorem do does es not The universal approximation theorem says that there exists a net work sa say y how large this netw network ork will b e. Barron (1993) pro provides vides some b ounds onlarge the enough to ac hieve any degree of accuracy w e desire, but the theorem do es not size of a single-lay single-layer er netw network ork needed to approximate a broad class of functions. sa y how large this netw ork will b e. Barron ( 1993 ) pro vides some b ounds on the Unfortunately Unfortunately,, in the worse case, an exp exponential onential num numb b er of hidden units (p (possibly ossibly size of single-lay er netw orkonding neededtotoeac approximate a broad class functions. with onea hidden unit corresp corresponding each h input configuration that of needs to b e Unfortunately , in the worse case, an exp onential num b er of hidden units (p ossibly distinguished) ma may y b e required. This is easiest to see in the binary case: the with one hidden unit corresp onding tooneac h inputv configuration needs to b e ∈ {0, 1}n is 22that number of p ossible binary functions vectors and selecting distinguished) ma y b e required. This is easiest to see in the binary case: one such function requires 2n bits, which will in general require O(2 n) degreesthe of 0, 1 is 2 and selecting number of p ossible binary functions on vectors v freedom. O(2 ) degrees of one such function requires 2 bits, which will in general ∈ { require } In summary summary, , a feedforward net network work with a single la layer yer is sufficient to represen representt freedom. an any y function, but the lay layer er may b e infeasibly large and ma may y fail to learn and In summary , a feedforward net work with a single la yer is sufficient represen generalize correctly correctly.. In many circumstances, using deep deeper er mo models dels cantoreduce thet an y function, butrequired the layer b e infeasibly large and maand y fail toreduce learn and n umber of units to may represent the desired function can the generalize correctly . In many circumstances, using deep er mo dels can reduce the amoun amountt of generalization error. number of units required to represent the desired function and can reduce the There families oferror. functions whic which h can b e appro approximated ximated efficien efficiently tly by an amoun t ofexist generalization arc architecture hitecture with depth greater than some value d, but whic which h require a muc much h larger There existisfamilies of functions whichorcan b eto appro efficien byban d. Inximated mo model del if depth restricted to b e less than equal many cases, thetly num numb er arc hitecture with depth greater than some v alue , but whic h require a muc h larger d of hidden units required by the shallow mo model del is exp exponen onen onential tial in n. Suc Such h results d mo del if depth is restricted to b e less than or equal to . In many cases, the num b er were first prov proven en for mo models dels that do not resemble the contin continuous, uous, differen differentiable tiable n of hidden units required by the shallow mo del is exp onen tial in . Suc h results neural netw networks orks used for machine learning, but hav havee since b een extended to these w eredels. firstThe provfirst en for mo dels that not resemble contin uous,, differen mo models. results were fordocircuits of logicthe gates (Håstad 1986). tiable Later neural netw orks used for machine learning, but hav e since b een extended to these work extended these results to linear threshold units with non-negative weigh weights ts mo dels. The first results w ere for circuits of logic gates ( Håstad , 1986 ). Later (Håstad and Goldmann, 1991; Ha Hajnal jnal et al., 1993), and then to netw networks orks with w ork extended these results to linear threshold units with non-negative weigh ts con continuous-v tinuous-v tinuous-valued alued activ activations ations (Maass, 1992; Maass et al., 1994). Many mo modern dern (neural Håstadnet and Goldmann , 1991;linear Ha jnal et al.,Leshno 1993), et andal.then to) netw orks with netwo wo works rks use rectified units. (1993 demonstrated continuous-v activ ations (Maass , 1992 Maassolynomial et al., 1994 ). ation Manyfunctions, mo dern that shallow alued netw networks orks with a broad family of; non-p non-polynomial activ activation neural netrectified works use rectified units. Leshno et al. (1993 ) demonstrated including linear units, linear hav havee universal approximation prop properties, erties, but these that shallow netw orks with a broad family of non-p olynomial activ ation functions, results do not address the questions of depth or efficiency—they sp specify ecify only that rectified units, hav e universal approximation prop erties, butu these aincluding sufficiently wide linear rectifier netw network ork could represen represent t any function. Pascan Pascanu et al. results do not address the questions of depth or efficiency—they sp ecify only that a sufficiently wide rectifier network could represent any function. Pascanu et al. 198
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
(2013b) and Montufar et al. (2014) sho showed wed that functions representable with a deep rectifier net can require an exp exponen onen onential tial num umb b er of hidden units with a shallow ((one 2013b ) and Montufar et al. ( 2014 ) sho wed that representable a hidden lay layer) er) net network. work. More precisely precisely,, theyfunctions show showed ed that piecewisewith linear deep rectifier net can can brequire an exp onen tial num b er of hiddenorunits withunits) a shallow net networks works (which e obtained from rectifier nonlinearities maxout can (one hidden lay er) net work. More precisely , they show ed that piecewise linear represen representt functions with a num numb b er of regions that is exp exponen onen onential tial in the depth of the net works Fig. (which b e obtained rectifier maxout units) can net network. work. 6.5can illustrates how afrom netw network ork withnonlinearities absolute valueorrectification creates represenimages t functions a numcomputed b er of regions thatofissome exp onen tial in the with depthresp of the mirror of thewith function on top hidden unit, respect ect net work. Fig. 6.5 illustrates how a netw ork with absolute v alue rectification creates to the input of that hidden unit. Each hidden unit sp specifies ecifies where to fold the mirror images of the function computed on top of some hidden unit,absolute with resp ect input space in order to create mirror resp responses onses (on b oth sides of the value to the input hidden unit. folding Each hidden unit sp toonentially fold the nonlinearit nonlinearity). y).ofBythat comp composing osing these op operations, erations, weecifies obtainwhere an exp exponentially input nspace in of order to create mirror resp onses (oncan b oth sides ofallthe absolute value large umber piecewise linear regions which capture kinds of regular nonlinearit y). Bypatterns. comp osing these folding op erations, we obtain an exp onentially (e.g., rep repeating) eating) large number of piecewise linear regions which can capture all kinds of regular (e.g., rep eating) patterns.
Figure 6.5: An intuitiv intuitive, e, geometric explanation of the exp exponential onential adv advan an antage tage of deep deeper er rectifier netw networks orks formally shown by Pascan Pascanu u et al. (2014a) and by Montufar et al. (2014). Figure 6.5:absolute An intuitiv e, rectification geometric explanation of same the exp onential antage deep er (L (Left) eft) An value unit has the output for adv every pair of mirror et al. et al. rectifier orks formally shown byofPascan u (2014a ) and by Montufardefined (by 2014 p oin oints ts innetw its input. The mirror axis symmetry is given by the hyperplane the). (Leights eft) An absolute value rectification has theon same mirror w and bias of the unit. A functionunit computed top output of that for unitevery (the pair greenofdecision p oints inwill its input. The mirror axis of symmetry is given y theaxis hyperplane defined by the surface) b e a mirror image of a simpler pattern acrossbthat of symmetry symmetry. . (Center) weights and bias unit. Abfunction top ofthe that unit green. decision The function canofbethe obtained y foldingcomputed the space on around axis of (the symmetry symmetry. (R (Right) ight) surface) will b e a mirror image ofe afolded simpler across of symmetry . (Center) Another repeating pattern can b onpattern top of the firstthat (byaxis another do downstream wnstream unit) (Right) The function can be obtained(which by folding therep space around the axis of tw symmetry to obtain another symmetry is now repeated eated four times, with two o hidden. lay layers). ers). Another repeating pattern can b e folded on top of the first (by another downstream unit) to obtain symmetry (which is nowin repMontufar eated four et times, hiddenthat layers). More another precisely precisely, , the main theorem al. (with 2014tw ) ostates the
number of linear regions carv carved ed out by a deep rectifier net network work with d inputs, More precisely , the theorem depth l , and n units p ermain hidden lay layer, er, isin Montufar et al. (2014) states that the number of linear regions carved out by a deep rectifier network with d inputs, isd(l−1) ! depth l , and n units p er hidden lay er, n O nd , (6.42) d n O n , (6.42) d case of maxout netw i.e., exp exponential onential in the depth l. In the networks orks with k filters p er unit, the num numb b er of linear regions is i.e., exp onential in the depth l. In the case of maxout networks with k filters p er ! unit, the numb er of linear regionsOis k (l−1)+d . (6.43) O k 199
.
(6.43)
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Of course, there is no guaran guarantee tee that the kinds of functions we wan wantt to learn in applications of machine learning (and in particular for AI) share such a prop property erty erty.. Of course, there is no guarantee that the kinds of functions we want to learn in We ma may y also wan wantt to choose a deep mo model del for statistical reasons. Any time applications of machine learning (and in particular for AI) share such a prop erty. we choose a sp specific ecific machine learning algorithm, we are implicitly stating some W e ma y also wan choose a deep del of forfunction statistical Any time set of prior b eliefs wet to hav have e ab about out what mo kind thereasons. algorithm should w e choose aosing sp ecific machine we are implicitly some learn. Cho Choosing a deep mo model dellearning enco encodes des algorithm, a very general b elief that thestating function we set of prior b eliefs w e hav e ab out what kind of function the algorithm should want to learn should inv involv olv olvee comp composition osition of several simpler functions. This can b e learn. Cho osing a deep mo del enco des a very general b elief that wee in interpreted terpreted from a representation learning p oin oint t of view as sa saying ying the thatfunction we b eliev elieve w ant to learn should inv olv e comp osition of several simpler functions. This can be the learning problem consists of disco discovering vering a set of underlying factors of variation interpreted t of view as sa ying that w e b elievofe that can in from turnabrepresentation e describ described ed in learning terms ofp oin other, simpler underlying factors learning problem ,consists discovering a set underlying factorsasofexpressing variation vthe ariation. Alternately Alternately, we can of in interpret terpret the use of aofdeep arc architecture hitecture canthat in turn b e describ in terms of other, simpler program underlying factors of of athat b elief the function weedwant to learn is a computer consisting vm ariation. Alternately , we can in terpret the use of a deep arc hitecture as expressing ultiple steps, where eac each h step makes use of the previous step’s output. These a b elief that the function we w ant to learn is a computer program consisting in intermediate termediate outputs are not necessarily factors of variation, but can instead bofe multiple steps, where step makes theork previous output. These analogous to coun counters ters eac or h p ointers that use the of netw network uses tostep’s organize its internal in termediate outputs are not necessarily of result variation, but can instead b e pro processing. cessing. Empirically Empirically, , greater depth do does esfactors seem to in b etter generalization analogous to coun ters or p ointers that the netw ork uses to organize its internal for a wide variety of tasks (Bengio et al., 2007; Erhan et al., 2009; Bengio , 2009; pro cessing. Empirically , greater depth es seem to result b etter generalization Mesnil et al. al., , 2011; Ciresan et al. al., , 2012do ; Krizhevsky et al. al.,,in2012 ; Sermanet et al. al.,, for a wide v ariety of tasks ( Bengio et al. , 2007 ; Erhan et al. , 2009 ; Bengio , 2009 2013; Farab arabet et et al. al.,, 2013; Couprie et al. al.,, 2013; Kahou et al. al.,, 2013; Go Goo o dfello dfellow w; Mesnil et al.;, Szegedy 2011; Ciresan et al.,).2012 Krizhevsky et al.6.7 , 2012 Sermanet al., et al., 2014d et al., 2014a See; Fig. 6.6 and Fig. for ;examples of et some 2013 ; F arab et et al. , 2013 ; Couprie et al. , 2013 ; Kahou et al. , 2013 ; Go o dfello w of these empirical results. This suggests that using deep architectures do does es indeed et al., 2014d ; Szegedy al.,the 2014a ). See Fig. 6.6 and 6.7 learns. for examples of some express a useful prior et ov over er space of functions theFig. mo model del of these empirical results. This suggests that using deep architectures do es indeed express a useful prior over the space of functions the mo del learns. So far we hav havee describ described ed neural net networks works as b eing simple chains of la layers, yers, with the main considerations b eing the depth of the netw network ork and the width of eac each h la layer. yer. So far we hav e describ ed neural net works as b eing simple chains of la yers, with the In practice, neural net networks works sho show w considerably more div diversity ersity ersity.. main considerations b eing the depth of the network and the width of each layer. Man Many y neural net network work architectures ha have ve b een dev develop elop eloped ed for sp specific ecific tasks. In practice, neural networks show considerably more diversity. Sp Specialized ecialized arc architectures hitectures for computer vision called conv convolutional olutional net networks works are Man y neural net work architectures ha ve b een dev elop ed for sp ecific tasks. describ described ed in Chapter 9. Feedforward netw networks orks may also b e generalized to the Sp ecialized arc hitectures for computer vision called conv olutional net works are recurren recurrentt neural net networks works for sequence pro processing, cessing, describ described ed in Chapter 10, which describ ed in Chapter 9. Feedforward networks may also b e generalized to the ha have ve their own arc architectural hitectural considerations. recurrent neural networks for sequence pro cessing, describ ed in Chapter 10, which In general, the la layers yers need not b e connected in a chain, ev even en though this is the have their own architectural considerations. most common practice. Many arc architectures hitectures build a main chain but then add extra In general, the la yers need not b e skip connected in a chain, en though the arc architectural hitectural features to it, such as connections goingevfrom lay layer er ithis to is lay layer er common Many architectures a main chain but then imost + 2 or higher.practice. These skip connections mak makeebuild it easier for the gradient to add flo flow w extra from architectural it, such skip connections going from layer i to layer output lay layers ers features to lay layers ers to nearer the as input. i + 2 or higher. These skip connections make it easier for the gradient to flow from output layers to layers nearer the input.200
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.6: Empirical results showing that deeper netw networks orks generalize better when used to transcrib transcribee multi-digit numbers from photographs of addresses. Data from Go Goo o dfello dfellow w Figure 6.6: Empirical results showing that deeper netw orks generalize better when used et al. (2014d). The test set accuracy consistently increases with increasing depth. See to transcrib ulti-digit numbers from photographs addresses. Data from Go w Fig. 6.7 for ae m con control trol exp experimen erimen eriment t demonstrating that of other increases to the mo model delo dfello size do et al. (2014d Theeffect. test set accuracy consistently increases with increasing depth. See not yield the ).same Fig. 6.7 for a control exp eriment demonstrating that other increases to the mo del size do not yield the same effect.
201
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.7: Deep Deeper er mo models dels tend to p erform b etter. This is not merely b ecause the mo model del is larger. This exp experimen erimen erimentt from Go Goodfellow odfellow et al. (2014d) shows that increasing the number Figure 6.7: Deep mo dels to p erformnetw b etter. is notincreasing merely b ecause the moisdel is of parameters inerla layers yers of tend con convolutional volutional networks orksThis without their depth not et al. larger. This exp erimen t from Go odfellow ( 2014d ) shows that increasing the n umber nearly as effectiv effectivee at increasing test set p erformance. The legend indicates the depth of of parameters layers of con volutional networks without increasing their in depth is not net network work used toinmake each curve and whether the curve represents variation the size of nearly asolutional effective or at the increasing test set play erformance. The elegend indicates the depth of the conv convolutional fully connected layers. ers. We observ observe that shallow mo models dels in this net work used to make each curve and whether the curve represents v ariation in the size of con context text overfit at around 20 million parameters while deep ones can b enefit from having olutional or the fully connected ers. mo Wedel observ e thatashallow mo dels in ov this othe verconv 60 million. This suggests that using lay a deep model expresses useful preference over er context overfit at around million parameters while deep can b enefit fromthat having the space of functions the20mo model del can learn. Sp Specifically ecifically ecifically, , itones expresses a b elief the over 60 million. This suggests that using functions a deep mocomposed del expresses a useful preference over function should consist of many simpler together. This could result the space of functions the mo delthat can islearn. Sp ecifically expresses a b elief that(e.g., the either in learning a representation comp composed osed in turn, ofit simpler represen representations tations functiondefined should in consist functionsa composed together. This could result corners termsofofmany edges)simpler or in learning program with sequentially dep dependent endent either(e.g., in learning acate representation that then is comp osed inthem turnfrom of simpler represen tations (e.g., steps first lo locate a set of ob objects, jects, segment each other, then recognize corners defined in terms of edges) or in learning a program with sequentially dep endent them). steps (e.g., first lo cate a set of ob jects, then segment them from each other, then recognize them).
202
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Another key consideration of architecture design is exactly how to connect a pair of la layers yers to each other. In the default neural netw network ork la layer yer describ described ed by a linear Another key consideration of architecture design is exactly how to connect a transformation via a matrix W , every input unit is connected to every output pair ofMany layerssp toecialized each other. In thein default neural netw orkhav layer describ ed by a linear unit. specialized netw networks orks the chapters ahead have e fewer connections, so transformation via a matrix , every input unit is connected to every output W that eac each h unit in the input lay layer er is connected to only a small subset of units in unit.output Manylay sper. ecialized orks infor thereducing chaptersthe ahead hav fewer connections, so the layer. Thesenetw strategies num numb b ere of connections reduce thatneac h unit in the inputand layer connected to only a small subsetto of ev units in the umber of parameters theis amount of computation required evaluate aluate the netw output er. are These strategies for reducingenden the num of connections reduce the network, ork,lay but often highly problem-dep problem-dependen endent. t. Fb orerexample, conv convolutional olutional the n umber of parameters and the amount of computation required to ev aluate net networks, works, described in Chapter 9, use sp specialized ecialized patterns of sparse connections the netw are often highly problem-dep endent.InFor that are ork, very but effective for computer vision problems. thisexample, chapter,conv it isolutional difficult net works, described in Chapter 9 , use sp ecialized patterns of sparse connections to give much more sp specific ecific advice concerning the arc architecture hitecture of a generic neural that are very effective for computer vision problems. this chapter, it is difficult net network. work. Subsequen Subsequentt chapters develop the particular In architectural strategies that to give m uch more sp ecific advice concerning the arc hitecture of a generic neural ha have ve b een found to work well for different application domains. network. Subsequent chapters develop the particular architectural strategies that have b een found to work well for different application domains.
6.5
Bac Back-Propagation k-Propagation and Other Differen Differentiation tiation Algorithms 6.5 Back-Propagation and Other Differentiation AlgoWhen we use a feedforward neural netw network ork to accept an input x and pro produce duce an rithms output yˆ, information flo flows ws forward through the netw network. ork. The inputs x pro provide vide x andatpro When we use a feedforward neural netw ork to accept an input units duce an the initial information that then propagates up to the hidden eac each h la layer yer yˆ, information through output forward the ork. The provide and finally pro produces duces yˆflo . ws This is called forwar forward d netw pr prop op opagation agation agation. . inputs Duringx training, the initial information that then propagates up to the hidden units at eac yer forw forward ard propagation can contin continue ue onw onward ard un until til it pro produces duces a scalar cost costJ ). la The J (θh yˆ . This finally pro duces is calledet forwar d pr agation . During band ack-pr ack-prop op opagation agation algorithm (Rumelhart al., 1986a ),op often simply called btraining, ackpr ackprop op, forwws ardthe propagation canfrom contin onwto ard until it pro duces a scalar cost J (netw θ). The allo allows information theuecost then flow backw backwards ards through the network, ork, back-pr agation algorithm (Rumelhart et al., 1986a), often simply called backprop, in orderopto compute the gradient. allows the information from the cost to then flow backwards through the network, Computing an analytical expression for the gradien gradientt is straigh straightforward, tforward, but in order to compute the gradient. numerically ev such an expression can b e computationally exp evaluating aluating expensive. ensive. The Computing an analytical expression for the gradien t is straigh tforward, bac back-propagation k-propagation algorithm do does es so using a simple and inexp inexpensiv ensiv ensivee pro procedure. cedure.but numerically evaluating such an expression can b e computationally exp ensive. The The term back-propagation misunderstoo as meaning whole back-propagation algorithm do es isso often using misundersto a simple ando dinexp ensive prothe cedure. learning algorithm for multi-la multi-layer yer neural netw networks. orks. Actually Actually,, bac back-propagation k-propagation The term back-propagation is often misundersto as meaning the whole refers only to the metho method d for computing the gradient,o dwhile another algorithm, learning algorithm for multi-la yer neural orks. Actually back-propagation suc such h as sto stochastic chastic gradient descent, is used netw to p erform learning ,using this gradient. refers only to the metho d for computing the gradient, while another Furthermore, back-propagation is often misundersto misunderstood od as b eing sp specific ecificalgorithm, to multisuc h as sto chastic gradient descent, is used to p erform learning using this la layer yer neural netw networks, orks, but in principle it can compute deriv derivativ ativ atives es of any gradient. function F urthermore, back-propagation is often misundersto od as b eing sp ecific to multi(for some functions, the correct resp response onse is to rep report ort that the deriv derivativ ativ ativee of the layer neural networks, but in principle can compute deriv es of any function is undefined). Sp Specifically ecifically ecifically, , we it will describ describe e ho how w to ativ compute the function gradien gradientt (for the correct onse isxto the whose derivativ e of the ∇x fsome ( x, y) functions, f, where for an arbitrary functionresp is arep setort of vthat ariables deriv derivativ ativ atives es function is undefined). Sp ecifically , we will describ e ho w to compute the gradien are desired, and y is an additional set of variables that are inputs to the functiont f( x, y) for an arbitrary function f, where x is a set of variables whose derivatives are desired, and y is an additional set of203variables that are inputs to the function ∇
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
but whose deriv derivatives atives are not required. In learning algorithms, the gradient we most often require is the gradien gradientt of the cost function with resp respect ect to the parameters, but whose deriv atives are not required. In learning algorithms, thederiv gradient weeither most involv volv volvee computing other derivatives, atives, ∇θ J( θ ). Many machine learning tasks in often require the gradien t cess, of theorcost withlearned resp ect mo to del. the parameters, as part of theis learning pro process, to function analyze the model. The back) . Many machine learning tasks in volv e computing other deriv atives, either J ( θ propagation algorithm can b e applied to these tasks as well, and is not restricted as part of thethelearning or function to analyze learned moparameters. del. The back∇ computing to gradientpro of cess, the cost withthe resp respect ect to the The propagation algorithm can b e applied to these tasks as well, and is not restricted idea of computing deriv derivatives atives by propagating information through a netw network ork is to computing the gradient of the cost function with resp ect to the parameters. The very general, and can b e used to compute values such as the Jacobian of a function idea of computing deriv atives by propagating information through a netw ork is f with multiple outputs. We restrict our description here to the most commonly vused ery general, and fcan compute values such as the Jacobian of a function case where hasb ea used singletooutput. f with multiple outputs. We restrict our description here to the most commonly used case where f has a single output. So far we hav havee discussed neural net networks works with a relatively informal graph language. To describ describee the back-propagation algorithm more precisely precisely,, it is helpful to hav havee a So far we hav e discussed neural net works with a relatively informal graph language. more precise computational gr graph aph language. To describ e the back-propagation algorithm more precisely, it is helpful to have a Man Many y ways of formalizing computation more precise computational graph language.as graphs are p ossible. Here, useofeach no node de in computation the graph to as indicate variable. The variable may Many we ways formalizing graphsa are p ossible. b e a scalar, vector, matrix, tensor, or ev even en a variable of another typ ype. e. Here, we use each no de in the graph to indicate a variable. The variable may To formalize our graphs, we also need to in intro tro troduce duce the idea of an op oper er eration ation ation.. b e a scalar, vector, matrix, tensor, or even a variable of another typ e. An op operation eration is a simple function of one or more variables. Our graph language To formalize by oura graphs, we also to intro duce the idea of an operation. is accompanied set of allow allowable ableneed op operations. erations. Functions more complicated An op eration is a simple function of one or more v ariables. Our graph language than the op operations erations in this set ma may y b e describ described ed by comp composing osing many op operations erations is accompanied by a set of allow able op erations. F unctions more complicated together. than the op erations in this set may b e describ ed by comp osing many op erations Without loss of generality generality,, w wee define an op operation eration to return only a single together. output variable. This do does es not lose generalit generality y b ecause the output variable can ha hav ve Without loss of generality , w e define an op eration to return only a single multiple en entries, tries, suc such h as a vector. Softw Software are implementations of back-propagation output vsupp ariable. do es not losemgeneralit y b ecausebut thewe output caninha ve usually support ort This op operations erations with ultiple outputs, avoidvariable this case our multiple entries, suchitasin atro vector. implementations of not back-propagation description because intro troduces duces Softw man many yare extra details that are imp importan ortan ortantt to usually supp ort op erations with m ultiple outputs, but we a void this case in our conceptual understanding. description because it intro duces many extra details that are not imp ortant to If a variable y is computed by applying an op operation eration to a variable x, then conceptual understanding. we draw a directed edge from x to y . W Wee sometimes annotate the output node x, then a vname ariable is op computed by applying an optimes eration to this a variable withIfthe of ythe operation eration applied, and other omit lab label el when the x y w e draw a directed edge from to . W e sometimes annotate the output node op operation eration is clear from context. with the name of the op eration applied, and other times omit this lab el when the Examples of computational op eration is clear from context. graphs are shown in Fig. 6.8. Examples of computational graphs are shown in Fig. 6.8. 204
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.8: Examples of computational graphs. (a) The graph using the × op operation to eration > compute z = xy . (b) The graph for the logistic regression prediction yˆ = σ x w + b . (a) Figureof6.8: of computational graphs. graph usingalgebraic the opexpression eration to Some theExamples intermediate expressions do not ha have veThe names in the (b) z = xy yˆ = w+ b . compute The graphW for the logistic prediction ×uσ( ) .x (c) but need names. in the graph. e simply nameregression the i-th suc such h variable The Some of the intermediate do=not ha algebraic expression , Xnames W + bin computational graph for theexpressions expression H which computes a design max max{ { 0ve }, the (c) The i u but need names in the graph. W e simply name the -th suc h v ariable matrix of rectified linear unit activ activations ations H giv given en a design matrix con containing taining a .minibatch H = 0 , X W + max b computational graph for the expression , which computes a design of inputs X . (d) Examples a–c applied at most one op operation eration to each variable, but it H given matrix of rectified activ matrix containing agraph minibatch {a design }a computation is p ossible to applylinear moreunit than oneations op operation. eration. Here we show that X . than (d) Examples of inputsmore a–c applied most one to each variable, but it applies one op operation eration to the at weights linear regression mo model. del. w ofopaeration P The iseights p ossible apply morethe than one eration. Here show a tcomputation graph w are to used to make b oth theopprediction weigh weight decay p enalty yˆ andwethe λ that w 2. w applies more than one op eration to the weights of a linear regression mo del. The w . weights are used to make the b oth the prediction yˆ and the weight decay p enalty λ
205
P
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The chain rule of calculus (not to b e confused with the chain rule of probability) is used to compute the deriv derivativ ativ atives es of functions formed by comp composing osing other functions The c hain rule of calculus (not to b e confused with the chain rule of probability) is whose deriv derivativ ativ atives es are kno known. wn. Bac Back-propagation k-propagation is an algorithm that computes the compute derivativ es of compefficien osing other functions cused haintorule, with athe sp specific ecific order of functions op operations erationsformed that isbyhighly efficient. t. whose derivatives are known. Back-propagation is an algorithm that computes the Let x b e a real num numb b er, and let f and g b oth b e functions mapping from a real chain rule, with a sp ecific order of op erations that is highly efficient. number to a real num numb b er. Supp Suppose ose that y = g (x) and z = f (g (x)) = f (y ). Then x f and g b oth b e functions mapping from a real Let b e a real num b er, and let the chain rule states that y =dyg (x) and z = f (g (x)) = f (y ). Then number to a real numb er. Supp osedzthatdz = . (6.44) the chain rule states that dx dy dx dz dz dy = . (6.44) We can generalize this b ey eyond ond dx the scalar Suppose ose that x ∈ R m, y ∈ Rn , dy dxcase. Supp g maps from Rm to Rn , and f maps from R n to R. If y = g (x ) and z = R f (y ), then R We can generalize this b eyond the scalar case. Supp ose that x ,y , R R R R X ∂ z from ∂ zto∂ y j. If y = g (x ) and ∈ g maps from z = f (y ),∈then to , and f maps . (6.45) = ∂xi ∂ y j ∂ xi j ∂z ∂y ∂z . (6.45) = ∂x ∂y ∂x In vector notation, this may b e equiv equivalently alently written as > In vector notation, this may b e equivalently ∂ y written as (6.46) ∇xz = X ∇y z , ∂x ∂y z= z, (6.46) ∂y x g. matrix∂of where ∂x is the n × m Jacobian ∇ ∇ x can b e obtained by multiplying F rom this we see that the gradient of a v ariable is the n ∂ym Jacobian matrix of g . where a Jacobian matrix ∂x by a gradient ∇yz. The algorithm consists back-propagation × that the gradient of a variable x F rom this we see can b e obtained ultiplying of p erforming suc such h a Jacobian-gradient pro product duct for each op operation eration by in m the graph. z. The back-propagation algorithm consists a Jacobian matrix by a gradient wesuc doh not apply the bac back-propagation k-propagation algorithm merely to vectors, of pUsually erforming a Jacobian-gradient ∇ pro duct for each op eration in the graph. but rather to tensors of arbitrary dimensionalit dimensionality y. Conceptually Conceptually,, this is exactly the Usually w e do not apply the bac k-propagation algorithmis merely vectors, same as back-propagation with vectors. The only difference ho how w thetonum numb b ers but rather to tensors of arbitrary dimensionalit y . Conceptually , this is exactly the are arranged in a grid to form a tensor. We could imagine flattening each tensor same as back-propagation with vectors. The only difference is ho w the num b ers in into to a vector b efore we run back-propagation, computing a vector-v vector-valued alued gradient, are arranged in a gridthe to gradien form a ttensor. We could imagine each tensor and then reshaping gradient back into a tensor. In flattening this rearranged view, in to a v ector b efore we run back-propagation, computing a vector-v alued gradient, bac back-propagation k-propagation is still just multiplying Jacobians by gradien gradients. ts. and then reshaping the gradient back into a tensor. In this rearranged view, To denote the gradient of a value z with resp respect ect to a tensor , we write ∇ z , back-propagation is still just multiplying Jacobians by gradients. just as if were a vector. The indices into no now w ha have ve multiple co coordinates—for ordinates—for z with zy, To denote gradient of a value resp ect to W a tensor , we write example, a 3-Dthe tensor is indexed by three co coordinates. ordinates. e can abstract this awa way just as if a single were avvector. intothe no w have m ultiple ordinates—for ∇ all b y using ariable The represent complete tuple of co indices. For i to indices example, a 3-Dtuples tensori,is(∇ indexed bes y three ∂z co ordinates. We can abstract this away z)i giv p ossible index gives ∂ . This is exactly the same as how for all by using a single variable i to represent the complete tuple of indices. For all z) gives p ossible index tuples i, ( . This is exactly the same as how for all 206 ∇
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
∂z p ossible integer indices i in into to a vector, (∇x z )i giv gives es ∂x . Using this notation, we can write the chain rule as it applies to tensors. If = g ( ) and z = f ( ), then p ossible integer indices i into a vector, ( z ) gives . Using this notation, we X ∂Ifz = g ( ) and z = f ( ), then can write the chain rule as it applies to tensors. ∇ z= (∇ ∇ j) . (6.47) ∂ j j ∂z ( z= ) . (6.47) ∂ ∇ ∇
Using the chain rule, it is straigh straightforward tforward down wn an algebraic expression for X to write do the gradient of a scalar with resp respect ect to an any y no node de in the computational graph that Using the chain rule, it is straigh tforward to write dothat wn an algebraic in expression for pro produced duced that scalar. Ho Howev wev wever, er, actually ev evaluating aluating expression a computer the gradient of a scalar with resp ect to an y no de in the computational graph that in intro tro troduces duces some extra considerations. pro duced that scalar. However, actually evaluating that expression in a computer Sp Specifically ecifically ecifically, , many subexpressions expressions ma may y be rep repeated eated several times within the intro duces some extra sub considerations. overall expression for the gradien gradient. t. Any pro procedure cedure that computes the gradien gradientt Sp ecifically , many sub expressions ma y be rep eated several times within the will need to choose whether to store these sub subexpressions expressions or to recompute them o verall expression for the gradien t. Any pro cedure that computes the gradien sev several eral times. An example of ho how w these rep repeated eated sub subexpressions expressions arise is given int will whether to store the these sub expressions or twice to recompute them Fig. need 6.9. to In choose some cases, computing same sub subexpression expression would simply sev times.F An example of ho w thesethere rep eated expressions is of given in b e eral wasteful. For or complicated graphs, can bsub e exp exponentially onentiallyarise many these Fig. 6.9 . In some cases, computing the same sub expression twice would simply wasted computations, making a naiv naivee implemen implementation tation of the chain rule infeasible. b e w asteful. F or complicated graphs, there can b etwice exp onentially these In other cases, computing the same sub subexpression expression could b e amany validofway to w asted memory computations, making at a naiv implemen tation of the chain rule infeasible. reduce consumption the ecost of higher run runtime. time. In other cases, computing the same sub expression twice could b e a valid way to We first b egin by a version of the back-propagation algorithm that sp specifies ecifies reduce memory consumption at the cost of higher runtime. the actual gradient computation directly (Algorithm 6.2 along with Algorithm 6.1 We asso firstciated b egin forw by aard version of the back-propagation algorithm sp ecifies for the associated forward computation), in the order it will actuallythat b e done and the actual to gradient computation directlyof(Algorithm along witheither Algorithm 6.1 according the recursive application chain rule.6.2One could directly for the asso ciated forward computation), in the order actuallyasb ea done p erform these computations or view the description of it thewill algorithm sym symb band olic according to the recursive application of c hain rule. One could either directly sp specification ecification of the computational graph for computing the back-propagation. Howp erform computations ormake view explicit the description of the algorithm a symb olic ev ever, er, thisthese formulation do does es not the manipulation and theasconstruction sp ecification of the computational graph computing the back-propagation. Howof the sym symb b olic graph that p erforms the for gradient computation. Such a formulation ever, this formulation es not make explicit the manipulation andalso thegeneralize construction is presented below in do Sec. 6.5.6 , with Algorithm 6.5, where we to of the sym b olic graph that p erforms the gradient computation. Such a formulation no nodes des that con contain tain arbitrary tensors. is presented below in Sec. 6.5.6, with Algorithm 6.5, where we also generalize to First consider a computational graph describing ho how w to compute a single scalar no des that contain arbitrary tensors. u(n) (sa (say y the loss on a training example). This scalar is the quantit quantity y whose First consider a computational graph describing ho w to compute a single (1) ) . In gradien gradientt we wan wantt to obtain, with resp respect ect to the n i input no nodes des u to u (nscalar u (say the loss on a training∂uexample). This scalar is the quantity whose other words we wish to compute ∂u for all i ∈ {1 ,2 , . . . , ni } . In the application gradient we want to obtain, with resp ect to the n input no des u to u . In of back-propagation to computing gradients for gradien gradientt descent over parameters, i , . . . , n h, other words we wish to compute for all . In the uapplication ( n ) (1) to u (n ) u will b e the cost asso associated ciated with an example or1a,2minibatc minibatch, while of back-propagation to computing gradients for gradien t descent o ver parameters, } corresp correspond ond to the parameters of the mo model. del. ∈ { u will b e the cost asso ciated with an example or a minibatch, while u to u corresp ond to the parameters of the mo207 del.
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
We will assume that the no nodes des of the graph hav havee b een ordered in such a wa way y that we can compute their output one after the other, starting at u(n +1) and Weupwill the in noAlgorithm des of the 6.1 graph hav e de b een in such a wa going to assume defined , eac each h no node is asso associated ciated with any u(n). Asthat u (i)ordered u that we can their output one after the thefunction other, starting at and op operation eration f (i)compute and is computed by ev evaluating aluating going up to u . As defined in Algorithm 6.1, each no de u is asso ciated with an (i) aluating op eration f and is computed byuev = f (A (i)the ) function (6.48) A u = ) ts of u (i). (6.48) where A (i) is the set of all no nodes des that aref (paren parents A where is the A setpro ofcedure all no des that are paren of u . procedure that p erforms thetscomputations mapping n i inputs (1) (n ) ( n ) u to u to an output u . This defines a computational graph where each no node de n A pro cedure that p erforms the computations mapping inputs ( i ) ( i ) computes numerical value u by applying a function f to the set of argumen arguments ts u u u to to an output . This defines a computational graph where each no de ( i ) ( j ) ( i ) A that comprises the values of previous no nodes des u , j < i, with j ∈ P a(u ). The computes numerical value ugraph by applying a function the the set of argumen ts x , and fis settointo n i no input to the computational is the vector first nodes des A a(u (output) i, with that the values of previous no des ugraph , j< ). The ) u(1) to u (ncomprises . The output of the computational is read offj thePlast input ∈ first n no des no node de uto(n)the . computational graph is the vector x , and is set into the u to u . The output of the computational graph is read off the last (output) i = 1, . . . , ni no deuu(i) ← . x i i = 1, . . . , n x u i = ni + 1, . . . , n ← {u(j ) | j ∈ P a(u(i) )} A(i) ← (i)= n + (i)1, .(.i). , n u A ← f (A ) u j P a(u ) A u ← f({n) ( | )∈ } u ← u That algorithm sp specifies ecifies the forw forward ard propagation computation, whic which h we could put in a graph G . In order to p erform back-propagation, we can construct a That algorithm ecifies the forw propagation whic h weThese could G and computational graphspthat dep depends ends onard adds to itcomputation, an extra set of no nodes. des. put ina subgraph a graph B. with In order todep erform back-propagation, we incan construct B pro form one no node p er no node de of G. Computation proceeds ceeds ina computational graph dep ends on and adds an each extrano set These G of that G, itand exactly the reverse the order of computation into node deofofnoBdes. computes form a subgraph with one no de p er no de of . Computation in pro ceeds in the deriv derivativ ativ ativee ∂u asso associated ciated with Gthe forw forward ard graph no node de u(i) . This is done ∂u exactly the reverseB of the order of computation in G , and each no de of B computes using the chain rule with resp respect ect to scalar output u(n) : the derivative asso ciated with the forwardGgraph no de u . This B is done ( n ) ( n ) ( i ) X output using the chain rule with ∂resp u ect to scalar ∂ u ∂uu : = (6.49) ∂ u(j ) ∂ u(i) ∂ u(j ) i:j ∈P a(u ) ∂ u ∂u ∂u = (6.49) ∂u ∂ u ∂ u as sp specified ecified by Algorithm 6.2. The subgraph B con contains tains exactly one edge for each edge from node u (j ) to no node de u(i) of G . The edge from u (j ) to u (i) is asso associated ciated with as sp ecified by Algorithm 6.2. The subgraph contains exactly one edge for each ∂u the computation of ∂u . In addition, a dot pro product duct is p erformed for each no node, de, edge from node u to no de u of . X The edge B from u to u (i)is asso ciated with b et etw ween the gradient already computed with resp respect ect to no nodes des u that are children the computation of . In addition, G a dot pro duct is p erformed for each no de, b etween the gradient already computed 208 with resp ect to no des u that are children
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
of u (j ) and the vector con containing taining the partial deriv derivatives atives ∂u for the same children ∂u ( i ) no nodes des u . To summarize, the amount of computation required for p erforming of u and the vector containing the partial derivatives for the same children the backback-propagation propagation scales linearly with the nu number mber of edges in G , where the no des u . To the amount of computing computation required forative p erforming computation forsummarize, each edge corresp corresponds onds to a partial deriv derivative (of one the scales linearly with of edges , where the no node debackwithpropagation resp respect ect to one of its paren parents) ts) as the wellnu asmber p erforming oneinmultiplication computation for each edge ondsthis to computing partialalued deriv ative one Gno and one addition. Belo Below, w, wecorresp generalize analysis to atensor-v tensor-valued nodes, des,(of whic which h no de with resp ect to one of its paren ts) as well as p erforming one m ultiplication is just a wa way y to group multiple scalar values in the same no node de and enable more and one addition. Belo w, w e generalize this analysis to tensor-v alued no des, which efficien efficientt implemen implementations. tations. is just a way to group multiple scalar values in the same no de and enable more efficient implemen tations. version of the bac Simplified back-propagation k-propagation algorithm for computing ( n ) the deriv derivativ ativ atives es of u with resp respect ect to the variables in the graph. This example is Simplified version back-propagation forallcomputing in intended tended to further understanding of bythe showing a simplified algorithm case where variables u the deriv ativ es of with resp ect to the v ariables in the graph. This example (1) ). are scalars, and we wish to compute the deriv derivativ ativ atives es with resp respect ect to u , . . . , u(n is in tended to further understanding b y showing a simplified case where all v ariables This simplified version computes the deriv derivativ ativ atives es of all no nodes des in the graph. The are scalars, and we wish to compute the deriv ativ es with resp tobuer of, . edges . . , u in. computational cost of this algorithm is prop proportional ortional to the ect num umb This simplified version computes thederiv deriv ativ ofciated all nowith des in theedge graph. The the graph, assuming that the partial derivativ ativ ative e es asso associated each requires computational cost of this algorithm is prop ortional to the n um b er of edges in a constant time. This is of the same order as the num numb b er of computations for the graph, assuming that the partial derivative asso ciated with each edge requires the forward propagation. Each ∂u is a function of the paren parents ts u(j ) of u(i), thus ∂u a constant time. This is of the same order as the numb er of computations for linking the no nodes des of the forward graph to those added for the back-propagation the forward propagation. Each is a function of the parents u of u , thus graph. linking the no des of the forward graph to those added for the back-propagation Run forward propagation (Algorithm 6.1 for this example) to obtain the activ activaagraph. tions of the net network work Run forward propagation (Algorithm for will this store example) to obtain aInitialize , a data structure6.1that the deriv derivativ ativ atives esthe thatactiv hav have e tions of the network [ u (i)] will store the computed value of b een computed. The entry _ Initialize , a data structure that will store the derivatives that have ∂u . [ u ] will store the computed value of b∂u een computed.(nThe entry _ _ [∂ u ) ] ← 1 . j = n − 1 do down wn to 1 P _ [∂ u ] 1 ∂u ∂u = The next line computes ∂u using stored values: i:j ∈P a(u ) ∂u ∂u j = n 1 down ← to 1P ∂u ∂u (j ) ] ← _ line[ucomputes _ [u (i)] ∂u using stored values: i:j ∈P a(u =) The next −
[u ] ] [u (i)] | i = 1, . . . , n }_ i ← _ [u ] i = 1, . . . , n Pdesigned to reduce the number of common The back-propagation algorithm is | } { sub subexpressions expressions without regard to memory memory.. Sp Specifically ecifically ecifically,, it p erforms on the order P The back-propagation algorithm is designed reduce umber of common of one Jacobian pro product duct p er no node de in the graph.toThis canthe b e nseen from the fact sub expressions without regard to memory . Sp ecifically , it p erforms on the ( j ) i) of in Algorithm 6.2 that bac backprop kprop visits each edge from no node de u to no node de u(order of one Jacobian pro duct p er no de in the graph. This can b e seen from the∂ufact the graph exactly once in order to obtain the asso associated ciated partial deriv derivativ ativ ativee . in Algorithm 6.2 that backprop visits each edge from no de u to no de u ∂u of Bac Back-propagation k-propagation thus av avoids oids the exp exponential onential explosion in rep repeated eated sub subexpressions. expressions. the graph exactly once in order to obtain the asso ciated partial derivative . 209 Back-propagation thus avoids the exp onential explosion in rep eated sub expressions. _ {
_[u
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Ho Howev wev wever, er, other algorithms may b e able to av avoid oid more sub subexpressions expressions by p erforming simplifications on the computational graph, or ma may y b e able to conserv conservee memory by Ho wev er, other algorithms may b e able to av oid more sub expressions by these p erforming recomputing rather than storing some sub subexpressions. expressions. We will revisit ideas simplifications on the computational graph, or ma y b e able to conserv e memory by after describing the bac back-propagation k-propagation algorithm itself. recomputing rather than storing some sub expressions. We will revisit these ideas after describing the back-propagation algorithm itself. To clarify the ab abov ov ovee definition of the bac back-propagation k-propagation computation, let us consider the sp specific ecific graph asso associated ciated with a fully-connected multi-lay ulti-layer er MLP MLP.. To clarify the ab ove definition of the back-propagation computation, let us consider Algorithm 6.3 first shows the forw forward ard propagation, which maps parameters to the sp ecific graph asso ciated with a fully-connected multi-layer MLP. ˆ , y) asso the sup supervised ervised loss L( y associated ciated with a single (input,target) training example 6.3 output first shows theneural forward propagation, maps parameters (x, yAlgorithm ), with yˆ the of the netw is provided in input. to network ork when xwhich ˆ , y) asso ciated with a single (input,target) training example the sup ervised loss L( y Algorithm 6.4 then shows ws the corresp corresponding computation toinbeinput. done for (x, y ), with yˆ the output sho of the neural netwonding ork when x is provided applying the back-propagation algorithm to this graph. Algorithm 6.4 then shows the corresp onding computation to be done for Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e applying the back-propagation algorithm to this graph. simple and straightforw straightforward ard to understand. How Howev ev ever, er, they are sp specialized ecialized to one Algorithm 6.3 and Algorithm 6.4 are demonstrations that are chosen to b e sp specific ecific problem. simple and straightforward to understand. However, they are sp ecialized to one Mo Modern dern softw software are implementations are based on the generalized form of backsp ecific problem. propagation describ described ed in Sec. 6.5.6 b elo elow, w, whic which h can accommo accommodate date any computaMograph dern softw are implementations based on the generalized formsym of backtional by explicitly manipulatingare a data structure for represen representing ting symb b olic propagation describ ed in Sec. 6.5.6 b elo w, whic h can accommo date any computacomputation. tional graph by explicitly manipulating a data structure for representing symb olic computation. Algebraic expressions and computational graphs b oth op operate erate on symb symbols ols ols,, or variables that do not hav havee sp specific ecific values. These algebraic and graph-based Algebraic expressions and symb computational graphs b oth op erate on symbols represen representations tations are called symbolic olic representations. When we actually use, or or vtrain ariables that do not hav e sp ecific v alues. These algebraic and graph-based a neural netw network, ork, we must assign sp specific ecific values to these sym symb b ols. We represen tations are called symb olic representations. When we actually use as or replace a sym symb b olic input to the netw network ork x with a sp specific ecific numeric value, such train a neural netw ork, we must assign sp ecific v alues to these sym b ols. W e [1.2, 3.765, −1.8]> . replace a symb olic input to the network x with a sp ecific numeric value, such as [1.2Some , 3.765approaches , 1.8] . to back-propagation take a computational graph and a set of numerical values for the inputs to the graph, then return a set of numerical − Some approaches to back-propagation a computational graph and a set values describing the gradien gradient t at those inputtake values. We call this approach “symbolof numerical values for the This inputs graph, used then by return a setsuc of hnumerical to-n to-number” umber” differentiation. is to thethe approach libraries such as Torc orch h alues describing gradien at those values. We call this approach “symbol(vCollob Collobert ert et al., the 2011b ) andt Caffe (Jiainput , 2013 ). to-number” differentiation. This is the approach used by libraries such as Torch Another approach is to take a computational graph and add additional no nodes des (Collob ert et al., 2011b) and Caffe (Jia, 2013). to the graph that pro provide vide a symbolic description of the desired deriv derivativ ativ atives. es. This Another approach is to take a computational graph and add additional no des to the graph that provide a symbolic description of the desired derivatives. This 210
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.9: A computational graph that results in rep repeated eated sub subexpressions expressions when computing the gradient. Let w ∈ R b e the input to the graph. We use the same function f : R → R Figure A computational graph results eated sub x =expressions f( w), y =when f ( x),computing z = as the 6.9: operation that R we apply atthat every step in of rep a chain: R f(yR). w applyb eEq. the gradient. Let the6.44 input to obtain: the graph. We use the same function f : T o compute , we and f(y ). as the operation that ∈ we apply at every step of a chain: x = f( w), y = f ( x), z = → ∂ z and obtain: To compute , we apply Eq. 6.44 (6.50) ∂w ∂∂zz ∂ y ∂ x (6.50) (6.51) = ∂w ∂y ∂x ∂w ∂z ∂y ∂x =f 0(y )f 0(x)f 0 (w ) (6.52) (6.51) ∂y ∂x ∂w 0 0 0 (6.53) =f (f (f (w )))f (f (w ))f (w ) =f (y )f (x)f (w ) (6.52) ( w ) = f ( f ( f ( w ))) f ( f ( w )) f Eq. 6.52 suggests an implemen implementation tation in which we compute the value of f (w ) only(6.53) once and store it in the variable x . This is the approach taken by the back-propagation (w ) expression Eq. 6.52 suggests an implemen tationisinsuggested which webcompute the value the of f sub only once algorithm. An alternative approach y Eq. 6.53 , where subexpression x it in thethan variable . This is the approach taken back-propagation fand (w ) store app appears ears more once. In the alternativ alternative e approac approach, h, f (w)byis the recomputed eac each h time algorithm. alternative approach is suggested y Eq. 6.53of, where the sub expression it is needed.AnWhen the memory required to storebthe value these expressions is low, f ( w f ( w ) ) app ears more than once. In the alternativ e approac h, is recomputed eac h time the back-propagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced it is needed. When the memory required to store the v alue of these expressions is low, run runtime. time. How However, ever, Eq. 6.53 is also a valid implementation of the chain rule, and is useful the back-propagation approach of Eq. 6.52 is clearly preferable b ecause of its reduced when memory is limited. runtime. However, Eq. 6.53 is also a valid implementation of the chain rule, and is useful when memory is limited.
211
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Forward propagation through a typical deep neural netw network ork and the computation of the cost function. The loss L(yˆ, y ) dep depends ends on the output yˆ F orward propagation through a typical deep neuralTnetw ork and and on the target y (see Sec. 6.2.1.1 for examples of loss functions). o obtain the L ( y ˆ , y yˆ the computation of the cost function. The loss ) dep ends on the output total cost J , the loss may b e added to a regularizer Ω( Ω(θθ ), where θ con contains tains all the yts(see and on the target 6.2.1.1 for examples of losshow functions). To obtain the parameters (weigh (weights andSec. biases). Algorithm 6.4 shows to compute gradien gradients ts J θ θ total cost , the loss may b e added to a regularizer Ω( ) , where con tains all the of J with resp respect ect to parameters W and b. For simplicity simplicity,, this demonstration uses parameters (weigh ts and biases). Algorithm 6.4 shows how to gradien ts only a single input example x. Practical applications should usecompute a minibatch. See J W b of with resp ect to parameters and . F or simplicity , this demonstration uses Sec. 6.5.7 for a more realistic demonstration. only a single input example x. Practical applications should use a minibatch. See Net Netw work depth, l Sec. 6.5.7 for (ai),more i ∈ {1realistic , . . . , l },demonstration. the weigh weightt matrices of the mo model del W Net ork depth, l (i)w model del b , i ∈ {1, . . . , l }, the bias parameters of the mo ,i 1, . . . , l , the weight matrices of the mo del W x, the input to pro process cess , i ∈ 1{, . . . , l ,}the bias parameters of the mo del b y , the target output { to pro } cess h (0) = xx, the∈input k = y1,, .the . . , ltarget output h a (k=) = x b(k) + W (k) h (k−1) . .(,kl)) h (kk)==1f, .(a a = b +W h h h(l= ) f (a ) yˆ = J = L(yˆ, y ) + λΩ(θ ) yˆ = h J = L(yˆ, y ) + λΩ(θ ) is the approach taken by Theano (Bergstra et al., 2010; Bastien et al., 2012) and TensorFlow (Abadi et al., 2015). An example of how this approach works is the approach by Theano (Bergstraadv etan al. , 2010 Bastien et al. is illustrated in taken Fig. 6.10 . The primary advan antage tage of ;this approac approach h ,is2012 that) and deriv TensorFlow (Abadi et ed al.,in2015 An example approach works the derivatives atives are describ described the).same languageofashow thethis original expression. is illustrated in Fig. 6.10 . The primary adv an tage of this approac h is Because the deriv derivativ ativ atives es are just another computational graph, it is p ossible tothat run the derivatives areagain, describ ed in the same language original expression. bac back-propagation k-propagation differentiating the deriv derivatives ativesasinthe order to obtain higher Because the Computation derivatives areofjust another computational itedisinp ossible to run deriv derivatives. atives. higher-order deriv derivativ ativ atives es is graph, describ described Sec. 6.5.10 . back-propagation again, differentiating the derivatives in order to obtain higher We will use the latter approach and describ describee the bac back-propagation k-propagation algorithm in derivatives. Computation of higher-order derivatives is describ ed in Sec. 6.5.10. terms of constructing a computational graph for the deriv derivativ ativ atives. es. Any subset of the Wemay will then use the approach describ e the bacvk-propagation in graph b e latter ev evaluated aluated usingand sp specific ecific numerical alues at a lateralgorithm time. This terms a computational graph eac for hthe derivativshould es. Anybsubset of the allo allows ws ofusconstructing to avoid specifying exactly when each operation e computed. graph may then b e ev aluated using sp ecific numerical v alues at a later time. This Instead, a generic graph ev evaluation aluation engine can ev evaluate aluate every no node de as so soon on as its allo ws us to a v oid specifying exactly when eac h operation should b e computed. paren parents’ ts’ values are av available. ailable. Instead, a generic graph evaluation engine can evaluate every no de as so on as its The description of the symbol-to-symbol based approach subsumes the symbolparents’ values are available. to-n to-number umber approac approach. h. The symbol-to-num symbol-to-numb b er approach can b e understo understood od as The description of the symbol-to-symbol based approach subsumes the symbolp erforming exactly the same computations as are done in the graph built by the to-n umber approac h. The symbol-to-num b er approach can sym b e understo od as sym symb b ol-to-sym ol-to-symb b ol approach. The key difference is that the symb b ol-to-n ol-to-number umber p erforming exactly the same computations as are done in the graph built by the symb ol-to-symb ol approach. The key212 difference is that the symb ol-to-number
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Backwar Backward d computation for the deep neural net network work of Algorithm 6.3, which uses in addition to the input x a target y. This computation Backwar d activ computation theeac deep neural networkfrom of Algok) for k, starting yields the gradients on the activations ations a(for each h la layer yer the x y rithm 6.3 , which uses in addition to the input a target . This computation output lay layer er and going bac backwards kwards to the first hidden lay layer. er. From these gradients, a k, starting yields the gradients on the activ ations for eac h la yer from the whic which h can b e interpreted as an indication of how eac each h lay layer’s er’s output should change output layerror, er andone going kwards the first layer. From these gradients, to reduce canbac obtain theto gradien gradient t onhidden the parameters of eac each h lay layer. er. The whic h can b e interpreted as an indication of how eac h lay er’s output should change gradien gradients ts on weigh weights ts and biases can b e immediately used as part of a sto stoc chasto reduce error, one can obtain the gradien t on the parameters of eac h lay er. tic gradient up update date (p (performing erforming the up update date right after the gradients ha have ve bThe een gradien ts on weigh ts and biases can b e immediately used as part of a sto c hascomputed) or used with other gradient-based optimization metho methods. ds. tic gradient up date (p erforming the up date right after the gradients have b een After the forward computation, compute the gradient on the output lay layer: er: computed) or used with other gradient-based optimization metho ds. g ← ∇yˆJ = ∇yˆ L(yˆ, y ) After thel , forward compute the gradient on the output layer: k= l − 1, . . .computation, ,1 (yˆ, y ) t on the lay g Con J =the L Convert vert gradien gradient layer’s er’s output in into to a gradient into the prek∇= l , l ∇ . . . ,ation 1 ← nonlinearit nonlinearity y1,activ activation (elemen (element-wise t-wise multiplication if f is element-wise): Con the (k) )the layer’s output into a gradient into the preg ←vert ∇a − J =gradien g f 0(taon nonlinearitgradients y activation (elemen multiplication f isregularization element-wise): Compute on weigh weights ts t-wise and biases (including ifthe term, ( a ) g J = g f where needed): Compute gradients ←∇ on weights and biases (including the regularization term, ∇ b J = g + λ∇ b Ω(θ ) where J needed): ∇ = g h(k−1)> + λ∇ W Ω(θ ) W Ω(θ )w.r.t. the next lo J = gthe + λgradients Propagate lower-lev wer-lev wer-level el hidden lay layer’s er’s activ activations: ations: + λ J = g h Ω( θ ) ( k ) > ∇ ∇ J =W g g ← ∇h Propagate the gradients w.r.t. the next lower-level hidden layer’s activations: ∇ ∇ J =W g g ←∇
213
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.10: An example of the sym symb b ol-to-symbol approac approach h to computing deriv derivatives. atives. In this approach, the back-propagation algorithm do does es not need to ever access any actual Figure An example of the sym b ol-to-symbol h to computing derivatives.ho In sp specific ecific6.10: numeric values. Instead, it adds no nodes des to aapproac computational graph describing how w thiscompute approach, thederiv back-propagation algorithm doaluation es not need to ever access any actual to these derivativ ativ atives. es. A generic graph ev evaluation engine can later compute the sp ecific numeric values. Instead, it v adds no(L des to In a computational graph describing how deriv derivatives atives for any sp specific ecific numeric alues. (Left) eft) this example, we b egin with a graph to compute es. A generic graph aluation engine can later compute the represen representing ting these ))). . (Right) We run theevbac back-propagation k-propagation algorithm, instructing z = f (deriv f( f (ativ w))) eft) Inonding deriv for any sp ecificfor numeric values. (L this example, wethis b egin with a we graph it to atives construct the graph the expression corresp corresponding to . In example, do z = f ( f ( f ( w ))) (Right) represen ting . W e run the bac k-propagation algorithm, instructing not explain how the bac back-propagation k-propagation algorithm works. The purp purpose ose is only to illustrate it to construct theresult graphis: forathe expression corresp . In this example, of wethe do what the desired computational graphonding with atosymbolic description not explain deriv derivative. ative. how the back-propagation algorithm works. The purp ose is only to illustrate what the desired result is: a computational graph with a symbolic description of the derivative.
214
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
approac approach h do does es not exp expose ose the graph. approach do es not exp ose the graph. The back-propagation algorithm is very simple. To compute the gradient of some scalar z with resp respect ect to one of its ancestors x in the graph, we b egin by observing The back-propagation is zvery simple. o compute the gradient of some that the gradient withalgorithm resp respect ect to is given byTdz dz = 1. We can then compute z witht resp x zin in scalar to ect oneto of each its ancestors thethe graph, weby b egin by observing the gradien gradient withect resp respect parent of graph multiplying the that the gradient ect toofz the is given by that = 1pro . W e canz then curren current t gradien gradient t bywith the resp Jacobian op operation eration produced duced . We compute contin continue ue z the gradien t with resp ect to each parent of in the graph by multiplying the multiplying by Jacobians tra traveling veling bac backwards kwards through the graph in this wa way y until z curren t gradien t by the Jacobian of the op eration that pro duced . W e contin ue we reac reach h x. For any no node de that may b e reached by going bac backwards kwards from z through ultiplying y Jacobians traveling through the from graphdifferent in this wa y until tmwo or morebpaths, we simply sum bac thekwards gradients arriving paths at x. For any no de that may b e reached by going backwards from z through w e reac hde. that no node. two or more paths, we simply sum the gradients arriving from different paths at formally,, each no node de in the graph G corresp corresponds onds to a variable. To achiev achievee thatMore no de.formally maxim maximum um generality generality,, we describ describee this variable as b eing a tensor . T Tensor ensor can More formally , each no de in the graph corresp onds to a v ariable. T o achiev in general hav havee an any y num numb b er of dimensions, and subsume scalars, vectors, ande maximum generality, we describ e this variable as b eing a tensor . Tensor can G matrices. in general have any numb er of dimensions, and subsume scalars, vectors, and We assume that each variable is asso associated ciated with the following subroutines: matrices. that each is asso ciated with the following subroutines: ( ):variable •We assume _ This returns the op operation eration that computes , represen sented ted by the edges coming in into to in the computational graph. For example, ( ): This _ may b e a Python returns the op eration the thatmatrix computes , reprethere or C++ class representing multiplication sen ted by the in the computational or example, • op operation, eration, andedges the coming into function. Supp Suppose ose we graph. ha have ve a vFariable that there may b e a Python or C++ class representing the matrix multiplication ( ) is created by matrix multiplication, C = AB . Then _ op eration, and the function. Supp ose we ha ve a v ariable that returns a p oin ointer ter to an instance of the corresp corresponding onding C++ class. ( ) is created by matrix multiplication, C = AB . Then _ • returns _ a p ointer( to , Gan ): instance This returns thecorresp list of onding variables thatclass. are children of of the C++ in the computational graph G . _ ( , ): This returns the list of variables that are children of •• ( , G ) : GThisgraph in_the computational returns. the list of variables that are parents of in the computational graph G . G _ ( , ) : This returns the list of variables that are parents of inhthe computational graph . •Eac G is also Each op operation eration asso associated ciated with a op operation. eration. This G op operation eration can compute a Jacobian-vector pro product duct as describ described ed by Eq. 6.47. This Eac h op eration is also asso ciated with a op eration. This . Each is how the back-propagation algorithm is able to achiev achievee great generality generality. op eration can compute a Jacobian-vector pro duct as describ ed by Eq. 6.47 . This op operation eration is resp responsible onsible for kno knowing wing ho how w to back-propagate through the edges in is how the back-propagation algorithm is able to achiev e great generality . Each the graph that it participates in. For example, we might use a matrix multiplication op eration to is create resp onsible for kno wing w to ose back-propagate through the edges in C= ABho z with op operation eration a variable . Supp Suppose that the gradient of a scalar the graph participates in. Fmatrix or example, we might op use a matrix multiplication C isitgiven resp respect ect to that by G . The multiplication operation eration is resp responsible onsible for C = AB z ewith op eration to create a v ariable . Supp ose that the gradient of a scalar defining tw twoo bac back-propagation k-propagation rules, one for each of its input arguments. If w call resp ect to C is given by G . The matrix multiplication op eration is resp onsible for defining two back-propagation rules, one215 for each of its input arguments. If we call
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
the metho method d to request the gradient with resp respect ect to A giv given en that the gradient on the output is G , then the metho method d of the matrix multiplication op operation eration A the metho d to request the gradient with resp ect to giv en that the gradient > must state that the gradien gradientt with resp respect ect to A is given by GB . Likewise, if we G on the output is , then the metho d oft the multiplication op eration call the metho method d to request the gradien gradient withmatrix resp respect ect to B, then the matrix m ust state that the gradien t with resp ect to is given b y . Likewise, we A GB op operation eration is resp responsible onsible for implementing the metho method d and sp specifying ecifyingifthat B, then the call desired the methoisd given to request the t with resp ect toalgorithm matrix G .gradien the gradient by A> The back-propagation itself do does es op eration is resp onsible for implementing the metho d and sp ecifying that not need to know any differentiation rules. It only needs to call each op operation’s eration’s A G the desired gradient is given by . The back-propagation algorithm itself do es . ( , , ) must rules with the right argumen arguments. ts. Formally ormally,, not need to know any differentiation rules. It only needs to call each op eration’s return ( , , )(6.54) rules with the right X argumen must . ( Formally ) i ), i , . (∇ ts. return i .( ) ) , ( (6.54) whic which h is just an implemen implementation tation of the cchain hain rule as expressed in Eq. 6.47. Here, is a list of inputs∇that are supplied to the op operation, eration, is the whic h is just an implemen tation of the c hain rule as expressed in Eq. 6.47t. mathematical function that the op operation eration implemen implements, ts, is the input whose gradien gradient Here, is a list of inputs that are supplied to the op eration, is we wish to compute, and X is the gradient on the output of the op operation. eration. the mathematical function that the op eration implements, is the input whose gradient The metho method d should alwa always ys pretend that all of its inputs are distinct we wish to compute, and is the gradient on the output of the op eration. from each other, even if they are not. For example, if the op operator erator is passed The metho d should alwa ys pretend that all of its inputs are xdistinct 2 two copies of x to compute x , the metho method d should still return as the from each other, even if they are not. F or example, if the op erator is passed deriv derivative ative with resp respect ect to b oth inputs. The bac back-propagation k-propagation algorithm will later tadd wo copies of to compute , the metho d, should return the x x x astotal b oth of these arguments together to obtain 2x which still is the correct derivative deriv derivative ative with on x.resp ect to b oth inputs. The back-propagation algorithm will later add b oth of these arguments together to obtain 2x, which is the correct total Soft Software ware implemen implementations tations of bac back-propagation k-propagation usually pro provide vide b oth the op operaeraderivative on x. tions and their metho methods, ds, so that users of deep learning softw software are libraries are implementations of graphs back-propagation vide b othlik the op eraableSoft to ware back-propagate through built using usually commonpro op operations erations like e matrix tions and their exp metho ds, so that and userssoofon. deep learning software who libraries area m ultiplication, exponen onen onents, ts, logarithms, Soft Software ware engineers build able implementation to back-propagate through graphs built using common op erations e matrix new of back-propagation or adv advanced anced users who need tolikadd their m ultiplication, exp onen ts, logarithms, and so on. Soft ware engineers who build own op operation eration to an existing library must usually derive the metho method d fora new implementation of back-propagation or advanced users who need to add their an op any y new operations erations man manually ually ually.. own op eration to an existing library must usually derive the metho d for The back-propagation algorithm is formally describ described ed in Algorithm 6.5. any new op erations manually. In Sec. 6.5.2, we motiv motivated ated back-propagation as a strategy for av avoiding oiding6.5 computThe back-propagation algorithm is formally describ ed in Algorithm . ing the same sub subexpression expression in the chain rule multiple times. The naiv naivee algorithm In hav Sec.e 6.5.2 , we tial motiv ated back-propagation as a strategy for avoidingNow computcould have exp exponen onen onential run runtime time due to these rep repeated eated sub subexpressions. expressions. that ing the same sub expression in the chain rule m ultiple times. The naiv e algorithm we ha have ve sp specified ecified the bac back-propagation k-propagation algorithm, we can understand its comcould hav e exp onen tial run time that due to these rep eatedev sub expressions. Now that putational cost. If we assume each op operation eration evaluation aluation has roughly the w e ha ve sp ecified the bac k-propagation algorithm, w e can understand its comsame cost, then we may analyze the computational cost in terms of the number putational cost. If we assume each eration evaluation has roughly the of op operations erations executed. Keep inthat mind hereopthat we refer to an op operation eration as the same cost,talthen e may analyze the computational cost inactually terms of the nof umber fundamen fundamental unitwof our computational graph, which might consist very of op erations executed. Keep in mind here that we refer to an op eration as the man many y arithmetic op operations erations (for example, we migh mightt hav havee a graph that treats matrix fundamental unit of our computational graph, which might actually consist of very many arithmetic op erations (for example, 216we might have a graph that treats matrix
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The outermost skeleton of the back-propagation algorithm. This p ortion do does es simple setup and clean cleanup up work. Most of the imp importan ortan ortantt work happ happens ens The outermost skeleton of the back-propagation algorithm. This in the subroutine of Algorithm 6.6 .p ortion do es simple setup and cleanup work. Most of the imp ortant work happ ens in the subroutine Algorithm 6.6gradients must b e computed. T, the target set of vof ariables whose . G T, the computational graph the vtarget whose gradients must b e computed. z ,, the ariableset to of b evariables differentiated computational graph Let G 0 b e ,Gthe pruned to contain only no nodes des that are ancestors of z and descendents z , the v ariable to b e differentiated G of no nodes des in T. Let b e pruned to ,contain no desasso thatciating are ancestors and descendents Initialize a data only structure associating tensors of toztheir gradients T of no in . Gdes G _ [z ] ← 1 Initialize , a data structure asso ciating tensors to their gradients in T _ _ [z ]( , G1, G 0, _ ) T in ← , _to T ) ( , ,restricted Return _ G G T Return restricted to n no multiplication as a single op operation). eration). Computing a gradient in a graph with withn nodes des 2 will never execute more than O( n ) op operations erations or store the output of more than n no des multiplication as a Here singlewe opare eration). Computing a gradient in a graph with O (n2 ) op operations. erations. counting op operations erations in the computational graph, not will never execute more than ) op erations or store the output of more than O ( n individual op operations erations executed by the underlying hardware, so it is imp important ortant to O ( n ) op erations. Here we are counting op erations in the computational graph, not remem rememb b er that the run runtime time of eac each h op operation eration ma may y b e highly variable. For example, individual op erations executed by the underlying hardware, so it is imp ortant multiplying tw twoo matrices that each contain millions of en entries tries might corresp correspond ond to to er that theinrun of eac macomputing y b e highlythe variable. For example, aremem singlebop operation eration thetime graph. Whe op caneration see that gradient requires as multiplying tw oerations matrices that each contain millions of enstage tries will might ond to most operations b ecause the forw forward ard propagation at corresp worst execute O(n2 ) op a single op eration the graph. e can see that thewegradient as n no all nodes des in the in original graphW(dep (depending ending on computing which values wan wantt torequires compute, most ) op erations b ecause the forw ard propagation stage will at w orst execute O ( n we may not need to execute the entire graph). The back-propagation algorithm all n one no des in the originalpro graph ending on which values with we wan t tono compute, O(1) adds Jacobian-vector product, duct,(dep which should b e expressed nodes, des, p er w e may not need to execute the entire graph). The back-propagation algorithm edge in the original graph. Because the computational graph is a directed acyclic adds one Jacobian-vector duct, Fwhich b egraphs expressed (1) no des,used p er graph it has at most O ( n2pro ) edges. or theshould kinds of thatwith are O commonly edge in the original graph. Because the computational graph is a directed acyclic in practice, the situation is even b etter. Most neural netw network ork cost functions are graph it chain-structured, has at most O ( n )causing edges. back-propagation For the kinds of graphs that commonly O (are n) cost. roughly to ha have ve This isused far in practice, the situation is even b etter. Most neural netw ork cost functions are b etter than the naive approach, whic which h migh mightt need to execute exp exponen onen onentially tially man many y O ( n roughly chain-structured, causing back-propagation to ha ve ) cost. This is far no nodes. des. This p otentially exponential cost can b e seen by expanding and rewriting b etter than the naive approach, h might need the recursive chain rule (Eq. 6.49whic ) non-recursiv non-recursively: ely: to execute exp onentially many no des. This p otentially exponential cost can b e seen by expanding and rewriting t X Y the recursive chain∂rule ely: u (n) (Eq. 6.49) non-recursiv ∂ u (π ) = . (6.55) (π ) ∂ u(j ) ∂ u ∂u path (u ,u ,...,u ), k=2 ∂ u = . (6.55) from π =j to π =n ∂u ∂u Since the num numb b er of paths from no node de j to no node de n can grow up to exp exponen onen onentially tially in the Since the numb er of paths from no de j to217 no de n can grow up to exp onentially in the Y X
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
( , G , G 0, The inner lo loop op subroutine _ _ ) of the back-propagation algorithm, called by the bac back-propagation k-propagation algorithm defined ( , , , inner lo op subroutine _ _ ) of in Algorithm 6.5The . the back-propagation algorithm, called by the back-propagation G G algorithm defined , the variable whose gradient should b e added to G and . in Algorithm 6.5. G , the graph to mo modify dify dify.. ariable whose b e added toin the andgradient. . thevrestriction of Ggradient to no nodes desshould that participate G ,0 ,the , the graph, atodata mo dify . G gradients structure mapping no nodes des to their , the restriction of to no des that participate in the gradient. is inG mapping no des to their gradients G ReturnG _ , a[ data ] structure is in Return _ [ ] i← 1
_ ( , G0 ) _ ( ) _ _ ( , G , G( 0 ,, ) _ ) _ ( ) (i) 0 G ← . ( _ ( , G), , ) _ ( , , , _ ) ← i← i+1 . ( G_G ( , ), , ) ← P i← i← + 1(i) G i ←_ [ ]= Insert and the op operations erations creating it into G _ [ ] = ← Return Insert and the op erations creating it into Return G Pthese paths, the num length of numb b er of terms in the ab abo ove sum, whic which h is the num numb b er of such paths, can grow exp exponen onen onentially tially with the depth of the forw forward ard propagation length of these paths, the num b er of terms in the ab o v e sum, whic h is the numbfor er graph. This large cost would b e incurred b ecause the same computation of such paths, can grow exp onen tially with the depth of the forw ard propagation ∂u would b e redone man many y times. T To o av avoid oid suc such h recomputation, we can think ∂u graph. This large cost would b e incurred b ecause the same computation for of back-propagation as a table-filling algorithm that tak takes es adv advan an antage tage of storing would b e redone∂umany times. To avoid such recomputation, we can think in intermediate termediate results . Eac Each h no node de in the graph has a corresp corresponding onding slot in a of back-propagation ∂u as a table-filling algorithm that takes advantage of storing table to store the gradien gradientt for that no node. de. By filling in these table entries in order, in termediate results . Eac h no de incommon the graph has a correspThis onding slot in a bac back-propagation k-propagation av avoids oids rep repeating eating many sub subexpressions. expressions. table-filling table to store the gradien t fordynamic that no de. By filling. in these table entries in order, strategy is sometimes called pr pro ogr gramming amming amming. back-propagation avoids rep eating many common sub expressions. This table-filling strategy is sometimes called dynamic programming. i
in 1← ← ←in
As an example, we walk through the bac back-propagation k-propagation algorithm as it is used to train a multila multilayer yer p erceptron. As an example, we walk through the back-propagation algorithm as it is used to Here we develop a very simple multila multilayer yer perception with a single hidden train a multilayer p erceptron. la layer. yer. To train this mo model, del, we will use minibatc minibatch h stochastic gradient descen descent. t. Here we develop a very simple multilayer perception with a single hidden 218minibatch stochastic gradient descent. layer. To train this mo del, we will use
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
The back-propagation algorithm is used to compute the gradien gradientt of the cost on a single minibatch. Sp Specifically ecifically ecifically,, we use a minibatch of examples from the training The back-propagation algorithm is used to compute gradien t ofclass the cost on ya. set formatted as a design matrix a vector the of asso associated ciated lab labels els X and singlenetw minibatch. Sp ecifically ,erweofuse a minibatch examples = max 0 , X the W (1)training max{ {from } . To The network ork computes a lay layer hidden featuresofH set formatted as a design matrix and a vector of asso ciated class lab elsour X y. simplify the presentation we do not use biases in this mo model. del. We assume that H = max 0{, 0X W} elemen The ork computes oferation hiddenthat features . Tto max max{ ,Z graphnetw language includes aa layer op operation can compute elementsimplify presentation we do not use biases this mo del.ov Were{ assume that our } then wise. Thethe predictions of the unnormalized log in probabilities over classes are max graph a op eration that can compute giv given en blanguage y H W (2)includes . We assume that our graph language includes a 0, Z elementwise. The predictions of the unnormalized log probabilities ov er {the probabilit } are then op operation eration that computes the cross-en cross-entropy tropy b et etween ween the targets y andclasses probability y given by H W . Webyassume that our graphlog language includes a resulting crossdistribution defined these unnormalized probabilities. The optropy erationdefines that computes cross-en tropy b et ween the targetsyypand the probabilit y JMLE en entropy the cost the . Minimizing this cross-entrop cross-entropy erforms maxim maximum um distribution defined bofy the these unnormalized logtoprobabilities. The resulting crosslik likeliho eliho elihooo d estimation classifier. How However, ever, mak makee this example more realistic, J en tropy defines the cost . Minimizing this cross-entrop y p erforms maxim um we also include a regularization term. The total cost likeliho o d estimation of the classifier. more realistic, However, to make this example we also include a regularization term. cost X The(1)total X 2 (2) 2 J = JMLE + λ Wi,j + W i,j (6.56) i,j
i,j
J =J +λ W + W (6.56) consists of the cross-en cross-entropy tropy and a weight decay term with co coefficien efficien efficientt λ. The computational graph is illustrated in Fig. 6.11. consists of the cross-entropy and weight decay term withco efficient λ. The thea gradien The computational graph for gradient t of this example is large enough that computational graph is illustratedX in Fig. 6.11 . X it would b e tedious to draw or to read. This demonstrates one of the b enefits The computational graph for the gradien t of this example is large enough that of the back-propagation algorithm, whic which h is that it can automatically generate it w ould b e tedious to draw or to read. This demonstrates one of the b enefits gradien gradients ts that would b e straigh straightforw tforw tforward ard but tedious for a softw software are engineer to of the back-propagation algorithm, whic h is that it can automatically generate deriv derivee man manually ually ually.. gradients that would b e straightforward but tedious for a software engineer to Weeman canually roughly trace out the b eha ehavior vior of the back-propagation algorithm deriv . by lo looking oking at the forw forward ard propagation graph in Fig. 6.11. To train, we wish W e can roughly trace out the of the algorithm J . vior ∇Wb eha to compute b oth ∇ W J and There are back-propagation tw two o different paths leading b y lo oking at the forw ard propagation graph in Fig. 6.11 . T o train, we wish J bac from to the w eights: one through the cross-en cost, and one backward kward cross-entropy tropy J J to compute b oth and . There are tw o different paths leading through the weigh weightt decay cost. The weigh weightt decay cost is relatively simple; it will J bac kward from to the one through (i) weights: i) ∇ ∇ alw always ays con contribute tribute 2λW to the gradient on W (the . cross-entropy cost, and one through the weight decay cost. The weight decay cost is relatively simple; it will The other path through the cross-en cross-entropy tropy cost is slightly more complicated. always contribute 2λW to the gradient on W . Let G b e the gradient on the unnormalized log probabilities U (2) pro provided vided by The other path through the cross-en tropy cost is slightly more complicated. the op operation. eration. The bac back-propagation k-propagation algorithm no now w needs to G U Let b e the gradient on the unnormalized log probabilities pro vided by explore tw two o different branc branches. hes. On the shorter branch, it adds H >G to the the op eration. The bac k-propagation algorithm no w needs to gradien gradientt on W (2), using the back-propagation rule for the second argument to explore twomultiplication different branc hes. On The the other shorter branch, it adds the G to the matrix operation. branch corresp corresponds ondsH to the longer t on W further , using along the back-propagation rule the second argument to cgradien hain descending the netw network. ork. First, thefor back-propagation algorithm the matrix∇multiplication (2)>operation. The other branch corresp onds to the longer computes using the back-propagation rule for the first argument H J = GW ctohain descending further along the network. First, the matrix multiplication op operation. eration. Next, thethe back-propagation op operation eration uses algorithm its bac backkJ = GW computes using the back-propagation rule for the first argument 219 to the matrix op eration uses its back∇ multiplication op eration. Next, the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Figure 6.11: The computational graph used to compute the cost used to train our example of a single-lay single-layer er MLP using the cross-entrop cross-entropy y loss and weigh weightt deca decay y. Figure 6.11: The computational graph used to compute the cost used to train our example of a single-layer MLP using the cross-entropy loss and weight decay.
propagation rule to zero out comp components onents of the gradien gradientt corresp corresponding onding to entries of U (1) that were less than 0. Let the result be called G 0 . The last step of the propagation rule toalgorithm zero out comp thek-propagation gradient corresp onding to second entries bac back-propagation k-propagation is to onents use theofbac back-propagation rule for the U G . The on of that were less than 0eration . Let the result be0 to called lastWstep >G (1) of the argumen argument t of the op operation to add X the gradient . back-propagation algorithm is to use the back-propagation rule for the second After these gradients hav havee b een computed, itGis the resp responsibilit onsibilit onsibility y of the gradien gradientt argument of the op eration to add X to the gradient on W . descen algorithm, or another optimization algorithm, to use these gradients to descentt After these gradients hav e b een computed, it is the resp onsibilit y of the gradien t up update date the parameters. descent algorithm, or another optimization algorithm, to use these gradients to For the the parameters. MLP MLP,, the computational cost is dominated by the cost of matrix up date multiplication. During the forward propagation stage, we multiply by each weight For resulting the MLPin , the is w dominated by of During matrix O ( wcomputational matrix, ) multiply-adds,cost where is the num numb b er the of wcost eigh eights. ts. multiplication. During the forward propagation e multiply of by eac each weightt the backw backward ard propagation stage, we multiply stage, by thewtranspose each h weigh weight O ( wsame w is The matrix, resulting in the ) multiply-adds, where the num b ermemory of weighcost ts. During matrix, which has computational cost. main of the the backw ard propagation stage, we m ultiply by the transpose of eac h weigh t algorithm is that we need to store the input to the nonlinearity of the hidden la layer. yer. matrix, which has the same cost. The main of has the This value is stored from thecomputational time it is computed un until til the memory backw backward ardcost pass algorithmto is that we need tot.store input cost to theis nonlinearity hidden yer. m islathe returned the same p oin oint. The the memory thus O(mnhof ), the where This value is stored in from time it and is computed untilb er theofbackw ard pass has number of examples thethe minibatch nh is the num numb hidden units. returned to the same p oint. The memory cost is thus O(mn ), where m is the number of examples in the minibatch and n is the numb er of hidden units. 220
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Our description of the bac back-propagation k-propagation algorithm here is simpler than the implemen mentations tations actually used in practice. Our description of the back-propagation algorithm here is simpler than the implenotedactually ab abo ove,used we hav have e restricted the definition of an op operation eration to b e a menAs tations in practice. function that returns a single tensor. Most softw implemen software are implementations tations need to As noted ab o v e, we hav e restricted the definition of an op eration to bwish e a supp support ort op operations erations that can return more than one tensor. For example, if we function that returns a single tensor. softw needit to to compute b oth the maximum value in Most a tensor andare theimplemen index of tations that value, is supp ort op erations that can return more than one tensor. F or example, if we wish b est to compute b oth in a single pass through memory memory,, so it is most efficien efficientt to to compute b oth the maximum v alue in a tensor and the index of that value, it is implemen implementt this pro procedure cedure as a single op operation eration with two outputs. b est to compute b oth in a single pass through memory, so it is most efficient to We ha hav ve not describ described ed ho how w to control the memory consumption of bac backkimplement this pro cedure as a single op eration with two outputs. propagation. Bac Back-propagation k-propagation often inv involves olves summation of man many y tensors together. W e ha v e not describ ed ho w to control the memory consumption of , bac kIn the naive approac approach, h, each of these tensors would b e computed separately separately, then propagation. Back-propagation involves many tensors together. all of them would b e added in often a second step.summation The naiv naiveeofapproac approach h has an overly In the naive approac h, each of these tensors would b e computed separately , then high memory b ottleneck that can b e avoided by main maintaining taining a single buffer and all of them would b e added in a second step. The naiv e approac h has an o verly adding each value to that buffer as it is computed. high memory b ottleneck that can b e avoided by maintaining a single buffer and Real-w Real-world orld implementations adding each value to that buffer of as back-propagation it is computed. also need to handle various data types, such as 32-bit floating p oint, 64-bit floating p oin oint, t, and integer values. Real-w orld implementations of back-propagation also need handle various The p olicy for handling each of these typ ypes es tak takes es sp special ecial care totodesign. data types, such as 32-bit floating p oint, 64-bit floating p oint, and integer values. Some op operations erations ha have ve undefined gradients, and it is imp important ortant to track these The p olicy for handling each of these typ es takes sp ecial care to design. cases and determine whether the gradient requested by the user is undefined. Some op erations have undefined gradients, and it is imp ortant to track these Various other technicalities make real-world differentiation more complicated. cases and determine whether the gradient requested by the user is undefined. These technicalities are not insurmoun insurmountable, table, and this chapter has describ described ed the key V arious other technicalities make real-world differentiation more complicated. in intellectual tellectual to tools ols needed to compute deriv derivatives, atives, but it is imp important ortant to b e aw aware are These technicalities are not insurmoun table, and this chapter has describ ed the k ey that many more subtleties exist. intellectual to ols needed to compute derivatives, but it is imp ortant to b e aware that many more subtleties exist. The deep learning comm community unity has been somewhat isolated from the broader computer science communit community y and has largely dev develop elop eloped ed its own cultural attitudes The deep learning comm unity has been somewhat isolated theautomatic broader concerning how to p erform differen differentiation. tiation. More generally generally, , the from field of computer science communit y and has largely dev elop ed its own cultural attitudes differ differentiation entiation is concerned with how to compute deriv derivatives atives algorithmically algorithmically.. The concerning how to algorithm p erform differen tiation. generally , the fieldto of automatic automatic bac back-propagation k-propagation describ described ed hereMore is only one approach differentiation with how computeclass derivofatives algorithmically . The differen differentiation. tiation.isItconcerned is a sp special ecial case of to a broader techniques called reverse bac k-propagation algorithm describ ed here is only one approach to automatic mo mode de ac accumulation cumulation cumulation.. Other approaches ev evaluate aluate the sub subexpressions expressions of the chain rule differen tiation. It is a sp ecial case of a broader class of techniques called reverse in different orders. In general, determining the order of ev evaluation aluation that results in modelo accumulation . Othercost approaches evaluate the sub expressions of the chain rule the lowest west computational is a difficult problem. Finding the optimal sequence in op different orders. In general, determining the order of (ev aluation ,that in of operations erations to compute the gradient is NP-complete Naumann 2008results ), in the the lowest computational cost is a difficult problem. Finding the optimal sequence of op erations to compute the gradient 221 is NP-complete (Naumann, 2008), in the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
sense that it ma may y require simplifying algebraic expressions in into to their least exp expensive ensive form. sense that it may require simplifying algebraic expressions into their least exp ensive For example, supp suppose ose we ha have ve variables p 1, p 2, . . . , pn represen representing ting probabilities form. and variables z1 , z2 , . . . , zn represen representing ting unnormalized log probabilities. Supp Suppose ose F or example, supp ose w e ha ve v ariables represen ting probabilities p , p , . . . , p we define and variables z , z , . . . , z representingexp( unnormalized log probabilities. Supp ose exp(z zi) qi = P , (6.57) we define exp(zzi ) i exp( exp(z ) q =out of exp , (6.57) where we build the softmax function exponentiation, onentiation, summation and division P exp(z ) op operations, erations, and construct a cross-entrop cross-entropy y loss J = − i p i log qi . A human where we build the softmax function out of exp summation andesdivision mathematician can observe that the deriv derivativ ativ ativeeonentiation, of J with resp respect ect to zi tak takes a very J = p log q op erations, and construct a cross-entrop y loss . A human simple form: q i − pi. The back-propagation algorithm is not capable of simplifying J with−resp mathematician observe that the deriv ative ofpropagate ect totsz through takes a all very the gradien gradientt thiscan way ay, , and will instead explicitly gradien gradients of P simple form: q andpexp . The back-propagation is not capable of simplifying the logarithm exponen onen onentiation tiation op operations erationsalgorithm in the original graph. Some softw software are the gradien t hthis ay, and (will instead ts through all to of libraries suc such as−w Theano Bergstra et explicitly al., 2010;propagate Bastien etgradien al., 2012 ) are able P the logarithm tiationsubstitution op erations in original graph. Someprop softw are p erform some and kindsexp ofonen algebraic to the improv improve e ov over er the graph proposed osed libraries suc h as Theano ( Bergstra et al. , 2010 ; Bastien et al. , 2012 ) are able to by the pure back-propagation algorithm. p erform some kinds of algebraic substitution to improve over the graph prop osed When the forward graph G has a single output no node de and eac each h partial deriv derivativ ativ ativee b∂u y the pure back-propagation algorithm. can b e computed with a constant amount of computation, back-propagation ∂u When the forward graph has a single output no de and each partial derivative guaran guarantees tees that the number of computations for the gradient computation is of can border e computed withb er aG constant amount of back-propagation the same as the num numb of computations forcomputation, the forward computation: this guarantees that the number of computations for the gradient computation is of ∂u can b e seen in Algorithm 6.2 b ecause eac each h lo local cal partial deriv derivative ative ∂u needs the same order as the numb er of computations for the forward computation: this to b e computed only once along with an asso associated ciated multiplication and addition can the b e recursive seen in Algorithm b ecause eac h lo6.49 cal ).partial deriv needs for chain-rule6.2 formulation (Eq. The ov overall erallative computation is to b e computed only once along asso ciated and addition O (# edges). therefore How Howev ev ever, er, itwith can an p otentially b emultiplication reduced by simplifying the for the recursive chain-rule formulation (Eq. 6.49). The erall computation is computational graph constructed by back-propagation, and ov this is an NP-complete O (# edges). therefore er,Theano it can pand otentially b e reduced by simplifying task. Implemen Implementations tationsHow suc such hevas TensorFlow use heuristics basedthe on computational graph constructed by back-propagation, and this is an NP-complete matc matching hing kno known wn simplification patterns in order to iteratively attempt to simplify task. Implemen tationsbac suc h as Theano only and for TensorFlow use heuristics basedofon the graph. We defined back-propagation k-propagation the computation of a gradient a matc hing kno wn simplification patterns in order to iteratively attempt to simplify scalar output but bac back-propagation k-propagation can b e extended to compute a Jacobian (either the graph. W e defined bac computation of gradient of ka of k differen differentt scalar no nodes desk-propagation in the graph,only or offora the tensor-v tensor-valued alued no node dea containing scalar output but bac k-propagation can b e extended to compute a Jacobian (either values). A naiv naivee implemen implementation tation may then need k times more computation: for kh differen of t ternal scalar no node desininthe theoriginal graph,forward or of a tensor-v alued no containing k eac each scalar in internal node graph, the naiv naive e de implementation vcomputes alues). Aknaiv e implemen tation then need k When times more computation: for gradien gradients ts instead of amay single gradient. the num numb b er of outputs eacthe h scalar no de in the graph, naive implementation of graphinternal is larger than the original number forward of inputs, it isthe sometimes preferable to k computes gradien ts instead of a single gradient. When the num b er of outputs. use another form of automatic differen differentiation tiation called forwar forward d mo mode de ac accumulation cumulation cumulation. oforward the graph larger than has thebneen umber of inputs, it is sometimes to F mo mode deiscomputation prop proposed osed for obtaining real-timepreferable computation usegradients another in form of automatic differen tiation called forwar d mo de ac cumulation of recurrent net networks, works, for example (Williams and Zipser , 1989 ). This. Forward mothe de computation prop osed for obtaining computation also av avoids oids need to storehas thebveen alues and gradients for thereal-time whole graph, trading of gradients in recurrent net works, for example ( Williams and Zipser , 1989 This off computational efficiency for memory memory.. The relationship b etw etween een forward).mo mode de also avoids the need to store the values and gradients for the whole graph, trading off computational efficiency for memory222 . The relationship b etween forward mo de
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
and backw backward ard mo mode de is analogous to the relationship b etw etween een left-multiplying versus righ right-multiplying t-multiplying a sequence of matrices, such as and backward mo de is analogous to the relationship b etween left-multiplying versus AB ABC Csuch D , as (6.58) right-multiplying a sequence of matrices, where the matrices can b e thought AB of as matrices. For example,(6.58) if D C Jacobian D, A is a column vector while has many rows, this corresponds to a graph with a D where matrices canybinputs, e thought as Jacobian matrices. For example, single the output and man many andof starting the m ultiplications from theifend is a column vector while has many rows, this corresponds to acorresp graph onds with to a and going backw backwards ards onlyArequires matrix-vector pro products. ducts. This corresponds single output and man y inputs, and starting the m ultiplications from the end the backw backward ard mo mode. de. Instead, starting to multiply from the left would inv involve olve a and going backw ards only requires matrix-vector pro ducts. This corresp onds to series of matrix-matrix pro products, ducts, whic which h mak makes es the whole computation muc much h more the backw mo de.ifInstead, starting multiply the left ould invto olve A has few D hasfrom exp expensiv ensiv ensive. e. ard How Howev ev ever, er, fewer er ro rows wsto than columns, it iswcheaper runa series of matrix-matrix pro ducts, which onding makes the whole computation the multiplications left-to-righ left-to-right, t, corresp corresponding to the forward mo mode. de. much more A D exp ensive. However, if has fewer rows than has columns, it is cheaper to run many communities outside of machine learning, it is mo more the In multiplications left-to-righ t, corresp onding to the forward de. common to implemen implementt differen differentiation tiation softw software are that acts directly on traditional programming In many communities outside machine learning, it isgenerates more common to language co code, de, suc such h as Python or Cofco code, de, and automatically programs implemen t differen tiation softwin arethese thatlanguages. acts directly on traditional programming that different functions written In the deep learning communit community y, language co de, suc h as Python or C co de, and automatically generates programs computational graphs are usually represented by explicit data structures created by that different functions written in theseapproach languages. thedrawbac deep learning communit y, sp specialized ecialized libraries. The sp specialized ecialized hasInthe drawback k of requiring the computational graphs are usually represented data structures createdthe by library developer to define the metho methods dsby forexplicit every op operation eration and limiting sp ecialized libraries. ecialized approach thebdrawbac k of How requiring user of the library to The onlysp those op operations erations thathas ha have ve een defined. Howev ev ever, er, the the library developer to define the metho ds for every op eration and limiting the sp specialized ecialized approach also has the b enefit of allowing customized back-propagation user library to only those op erations that hathe ve b een defined. Howev the rules of tothe b e developed for each op operation, eration, allowing developer to impro improve veer, sp speed eed sp ecialized also has the b enefit allowing pro customized back-propagation or stability approach in non-obvious wa ways ys that an of automatic procedure cedure would presumably rules to b e developed for each op eration, allowing the developer to impro ve sp eed b e unable to replicate. or stability in non-obvious ways that an automatic pro cedure would presumably Bac Back-propagation k-propagation is therefore not the only wa way y or the optimal wa way y of computing b e unable to replicate. the gradient, but it is a very practical method that con contin tin tinues ues to serve the deep Back-propagation is therefore the only wa y or the optimal wa y of computing learning communit community y very well. Innot the future, differen differentiation tiation tec technology hnology for deep the gradient, but it is a very practical method that con tin ues to serve the deep net networks works ma may y impro improve ve as deep learning practitioners b ecome more aw aware are of adv advances ances learning communit well. Indifferentiation. the future, differentiation technology for deep in the broader fieldyofvery automatic networks may improve as deep learning practitioners b ecome more aware of advances in the broader field of automatic differentiation. Some soft software ware framew frameworks orks supp support ort the use of higher-order deriv derivatives. atives. Among the deep learning softw software are frameworks, this includes at least Theano and TensorFlo ensorFlow. w. Some soft ware framew orks supp ort the use of higher-order deriv atives. Among the These libraries use the same kind of data structure to describ describee the expressions for deep learning softw are frameworks, this includes at least and tiated. TensorFlo w. deriv derivatives atives as they use to describ describee the original function bTheano eing differen differentiated. This These the same kind of datamachinery structure to e the expressions for meanslibraries that theuse sym symb b olic differentiation candescrib b e applied to deriv derivatives. atives. derivatives as they use to describ e the original function b eing differentiated. This In the context of deep learning, it is rare to compute a single second deriv derivative ative means that the symb olic differentiation machinery can b e applied to derivatives. of a scalar function. Instead, we are usually interested in prop properties erties of the Hessian In the context of deep learning, it is rare to compute a single second derivative 223 interested in prop erties of the Hessian of a scalar function. Instead, we are usually
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
matrix. If we ha hav ve a function f : Rn → R, then the Hessian matrix is of size n × n. nRwill b e the number of parameters in the In typical deep learning applications, R f : n n matrix. If we ha v e a function , then the Hessian matrix is of size mo model, del, which could easily num numb b er in the billions. The en entire tire Hessian matrix is. n In typical deeptolearning applications, the th thus us infeasible even represen represent. t. → will b e the number of parameters in × mo del, which could easily numb er in the billions. The entire Hessian matrix is Instead of explicitly computing the Hessian, the typical deep learning approac approach h thus infeasible to even represent. is to use Krylov metho methods ds ds.. Krylo Krylov v metho methods ds are a set of iterative techniques for Instead of explicitly computing Hessian, the tin ypical deep learningorapproac h p erforming various op operations erations lik likeethe approximately inv verting a matrix finding is to use Krylov to metho ds. Krylo v metho ds vare a set of iterative for appro approximations ximations its eigenv eigenvectors ectors or eigen eigenv alues, without using techniques any op operation eration p erforming v arious op erations lik e approximately in v erting a matrix or finding other than matrix-vector pro products. ducts. approximations to its eigenvectors or eigenvalues, without using any op eration In than ordermatrix-vector to use Krylovpro metho methods other ducts.ds on the Hessian, we only need to b e able to compute the pro product duct b etw etween een the Hessian matrix H and an arbitrary vector v. A In order to use Krylov metho ds on ,the Hessian, we only to b e able to straigh straightforward tforward tec technique hnique (Christianson 1992 ) for doing so is need to compute H compute the pro duct b etween the Hessian matrix and an arbitrary vector v. A h i > v = ∇x (∇ straightforward technique (H Christianson , 1992 so is to compute (6.59) . x f (x)))forvdoing
Both of the gradien gradientt computations may y b e computed automatif (x)) v .ma H v = in this ( expression (6.59) cally by the appropriate soft software ware library library. . Note that the outer gradien gradientt expression ∇ ∇ Both of the gradien t computations in this expression ma y b e computed automatitak takes es the gradient of a function of the inner gradien gradientt expression. cally by the appropriate software library. Note that the outer gradient expression If vthe is gradient itself a vector pro produced duced byhainner computational graph, it is imp important ortant to takes of a function of the gradienit expression. sp specify ecify that the automatic differen differentiation tiation softw software are should not differentiate through v If is itself a vector pro duced by a computational graph, it is imp ortant to the graph that pro produced duced v . sp ecify that the automatic differentiation software should not differentiate through While computing the Hessian is usually not advisable, it is p ossible to do with the graph that pro duced v . Hessian vector pro products. ducts. One simply computes H e(i) for all i = 1, . . . , n, where While computing the Hessian(i)is usually not advisable, it is p ossible to do with e(i) is the one-hot vector with e i = 1 and all other entries equal to 0. Hessian vector pro ducts. One simply computes H e for all i = 1, . . . , n, where e is the one-hot vector with e = 1 and all other entries equal to 0.
6.6
Historical Notes
F eedforward netw networks orks Notes can b e seen as efficient nonlinear function appro approximators ximators 6.6 Historical based on using gradient descent to minimize the error in a function approximation. From eedforward netw canthe b emo seen as feedforward efficient nonlinear function approximators F this p oin oint t oforks view, modern dern netw network ork is the culmination of based on using gradient descent to minimize the error in a function approximation. cen centuries turies of progress on the general function approximation task. From this p oint of view, the mo dern feedforward network is the culmination of The chain rule that underlies the back-propagation algorithm was inv inven en ented ted centuries of progress on the general function approximation task. in the 17th cen century tury (Leibniz, 1676; L’Hôpital, 1696). Calculus and algebra ha have ve The c hain rule that underlies the back-propagation algorithm was inv en ted long b een used to solve optimization problems in closed form, but gradient descen descentt in the 17th cen tury ( Leibniz , 1676 ; L’Hôpital , 1696 ). Calculus and algebra hato ve was not in intro tro troduced duced as a technique for iterativ iteratively ely appro approximating ximating the solution long b een used to solve until optimization closed optimization problems the 19thproblems cen century tury (inCauc Cauchy hy, form, 1847).but gradient descent was not intro duced as a technique for iteratively approximating the solution to Beginning in the 1940s, these function approximation techniques were used to optimization problems until the 19th century (Cauchy, 1847). motiv motivate ate machine learning mo models dels suc such h as the p erceptron. How Howev ev ever, er, the earliest Beginning in the 1940s, these function approximation techniques were used to motivate machine learning mo dels such224 as the p erceptron. However, the earliest
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
mo models dels were based on linear mo models. dels. Critics including Marvin Minsky p ointed out several of the fla flaws ws of the linear mo model del family family,, such as it inability to learn the mo dels w ere based on linear mo dels. Critics including Marvinnetw Minsky p ointed XOR function, which led to a backlash against the entire neural network ork approach. out several of the flaws of the linear mo del family, such as it inability to learn the Learning required thethe developmen development t of netw a multila multilay yer p erXOR function,nonlinear which ledfunctions to a backlash against entire neural ork approach. ceptron and a means of computing the gradient through such a mo model. del. Efficien Efficientt Learningofnonlinear theprogramming developmentb egan of a to multila yerinpthe erapplications the chain functions rule based required on dynamic app appear ear ceptron a means offor computing the gradient through such a moand del. Denham Efficient, 1960s andand 1970s, mostly con control trol applications (Kelley , 1960 ; Bryson applications rule based on, dynamic programming b egan appsensitivity ear in the 1961 ; Dreyfusof, the 1962chain ; Bryson and Ho 1969; Dreyfus , 1973) but alsotofor 1960s and 1970s, mostly for con trol applications ( Kelley , 1960 ; Bryson and Denham analysis (Linnainmaa, 1976). Werbos (1981) prop proposed osed applying these tec techniques hniques, 1961 ; Dreyfus , 1962; neural Brysonnet and Ho, 1969 ) but also for in sensitivity to training artificial networks. works. The; Dreyfus idea was, 1973 finally developed practice analysis ( Linnainmaa , 1976 ). W erbos ( 1981 ) prop osed applying these tec after b eing indep independently endently redisco rediscov vered in differen differentt wa ways ys (LeCun, 1985; hniques Park Parker er er,, to training artificial neural net works. The idea was finally developed in practice 1985; Rumelhart et al. al.,, 1986a). The b o ok Par Paral al allel lel Distribute Distributed d Pr Pro ocessing presen presented ted after b eing indep endently redisco v ered in differen t wa ys ( LeCun , 1985 ; Park the results of some of the first successful exp experimen erimen eriments ts with back-propagation inera, 1985 ; Rumelhart et al.et, 1986a ). The b o okcontributed Paral lel Distribute ocessing presented chapter (Rumelhart al., 1986b ) that greatlydtoPrthe p opularization theback-propagation results of some ofand the initiated first successful erimen ts dwith back-propagation in a of a very exp activ active e p erio eriod of researc research h in multi-la multi-layer yer chapternetw (Rumelhart et al.er, , 1986b ) thatput contributed greatly to the p opularization neural networks. orks. Ho Howev wev wever, the ideas forw forward ard by the authors of that b ook of back-propagation and initiated a v ery activ e p erio d of researc h in multi-layer and in particular by Rumelhart and Hinton go muc much h b eyond back-propagation. neuralinclude networks. Howev er,ab the forwcomputational ard by the authors of that b ook They crucial ideas about outideas the pput ossible implemen implementation tation of and in particular b y Rumelhart and Hinton go muc h b eyond back-propagation. sev several eral central asp aspects ects of cognition and learning, whic which h came under the name of They include crucial ideas ab out the p ossible computational implemen of “connectionism” b ecause of the imp importance ortance given the connections b etw etween eentation neurons sev eral central asp ects of cognition and learning, whic h came under the name of as the lo locus cus of learning and memory memory.. In particular, these ideas include the notion “connectionism” b ecause of the(Hin impton ortance of distributed representation Hinton et al. al.,given , 1986the ). connections b etween neurons as the lo cus of learning and memory. In particular, these ideas include the notion Following the success of back-propagation, neural net network work researc research h gained p opof distributed representation (Hinton et al., 1986). ularit ularity y and reac reached hed a p eak in the early 1990s. Afterwards, other machine learning F ollowing the ofopular back-propagation, neuraldeep network researc h gained that p optec techniques hniques b ecamesuccess more p until the mo modern dern learning renaissance ularit and reached a p eak in the early 1990s. Afterwards, other machine learning b egany in 2006. techniques b ecame more p opular until the mo dern deep learning renaissance that Theincore ideas b ehind modern feedforward netw networks orks hav havee not changed subb egan 2006. stan stantially tially since the 1980s. The same back-propagation algorithm and the same The coretoideas b ehind modern feedforward netwoforks e not changed subapproac approaches hes gradien gradient t descent are still in use. Most the hav improv improvement ement in neural stan tiallyp erformance since the 1980s. back-propagation algorithm and the First, same net network work from The 1986same to 2015 can b e attributed to tw two o factors. approacdatasets hes to gradien t descentthe are degree still in to use.which Moststatistical of the improv ement in neural larger hav havee reduced generalization is a net work p erformance from 1986 to 2015 can b e attributed to tw o factors. First, challenge for neural netw networks. orks. Second, neural netw networks orks hav havee b ecome muc much h larger, larger datasets hav e reduced the degree to which statistical generalization is aa due to more p ow owerful erful computers, and b etter softw software are infrastructure. How However, ever, challenge for neural networks. changes Second, ha neural netwved orksthe havpeerformance b ecome muc larger, small number of algorithmic hav ve impro improved ofh neural due to more p owerful net networks works noticeably noticeably. . computers, and b etter software infrastructure. However, a small number of algorithmic changes have improved the p erformance of neural One of these algorithmic changes was the replacement of mean squared error networks noticeably. with the cross-en cross-entropy tropy family of loss functions. Mean squared error was popular in these algorithmic wasreplaced the replacement of mean squared the One 1980sofand 1990s, but waschanges gradually by cross-entrop cross-entropy y losses anderror the with the cross-entropy family of loss functions. Mean squared error was popular in the 1980s and 1990s, but was gradually225 replaced by cross-entropy losses and the
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
principle of maximum likelihoo likelihood d as ideas spread b etw etween een the statistics comm community unity and the mac machine hine learning communit community y. The use of cross-en cross-entropy tropy losses greatly principle maximum likelihoo d asdels ideas spread b etwand een the statistics commwhich unity impro improved vedofthe p erformance of mo models with sigmoid softmax outputs, and previously the machine learning communit y. and The slow use of cross-en tropyusing losses had suffered from saturation learning when thegreatly mean impro ved the p erformance of mo dels with sigmoid and softmax outputs, which squared error loss. had previously suffered from saturation and slow learning when using the mean The other ma major jor algorithmic change that has greatly improv improved ed the p erformance squared error loss. of feedforward netw networks orks was the replacemen replacementt of sigmoid hidden units with piecewise The other ma jor algorithmic change that has greatly improvusing ed thethe p erformance linear hidden units, suc such h as rectified linear units. Rectification max max{ {0, z } of feedforward netw orks w as the replacemen t of sigmoid hidden units with piecewise function was in intro tro troduced duced in early neural netw network ork mo models dels and dates back at least 0, z linear units, such and as rectified linear units. Rectification max early as far hidden as the Cognitron Neo Neocognitron cognitron (Fukushima , 1975,using 1980).theThese function wasnot intro duced in early neural netw orkinstead mo dels applied and dates back at{ least } mo models dels did use rectified linear units, but rectification to as far as the Cognitron and Neo cognitron (Fukushima , 1975, 1980 ). These early nonlinear functions. Despite the early p opularity of rectification, rectification was mo dels did not use rectified linear units, but instead applied rectification to largely replaced by sigmoids in the 1980s, p erhaps b ecause sigmoids p erform b etter nonlinear functions. thesmall. early As p opularity of rectification, rectification was when neural net networks worksDespite are very of the early 2000s, rectified linear units largely replaced in the 1980s, p erhaps b ecause sigmoids erform bwith etter w ere avoided dueby tosigmoids a somewhat sup superstitious erstitious b elief that activ activation ation p functions when neural tiable networks are very small. As of This the early 2000s, rectified linear non-differen non-differentiable p oin oints ts must b e av avoided. oided. b egan to change in ab about out units 2009. w ere a voided due to a somewhat sup erstitious b elief that activ ation functions with Jarrett et al. (2009) observ observed ed that “using a rectifying nonlinearity is the single most non-differen tiableinp oin ts must b e av This of b egan to changesystem” in ab outamong 2009. imp importan ortan ortantt factor improving the p oided. erformance a recognition Jarrett al. (2009 ) observ that “using a rectifying nonlinearity sev several eral et differen different t factors ofed neural net network work arc architecture hitecture design. is the single most imp ortant factor in improving the p erformance of a recognition system” among For differen small datasets, al.work (2009 ) hitecture observed design. that using rectifying nonseveral t factors Jarrett of neuraletnet arc linearities is ev even en more imp important ortant than learning the weigh weights ts of the hidden la layers. yers. For small datasets, Jarretttoetpropagate al. (2009)useful observed that using rectifying nonRandom weigh weights ts are sufficient information through a rectified linearities is even more than learning the to weigh ts how of the layers. linear netw network, ork, allo allowing wing imp the ortant classifier lay layer er at the top learn to hidden map differen different t Random weigh ts are sufficient to propagate useful information through a rectified feature vectors to class identities. linear network, allowing the classifier layer at the top to learn how to map different When more data is available, learning b egins to extract enough useful kno knowledge wledge feature vectors to class identities. to exceed the p erformance of randomly chosen parameters. Glorot et al. (2011a) When more data is is available, learning b egins to extract useful knoinwledge sho showed wed that learning far easier in deep rectified linearenough netw networks orks than deep to exceed the p erformance of randomly c hosen parameters. Glorot et al. ( 2011a net networks works that hav havee curv curvature ature or two-sided saturation in their activ activation ation functions.) showed that learning is far easier in deep rectified linear networks than in deep Rectified linear units are also of historical in interest terest b ecause they show that networks that have curvature or two-sided saturation in their activation functions. neuroscience has con continued tinued to ha have ve an influence on the dev development elopment of deep Rectified linear units are also of historical in terest b ecause they units show from that learning algorithms. Glorot et al. (2011a) motiv motivate ate rectified linear neuroscience has con tinued to ha ve an influence on the dev elopment of deep biological considerations. The half-rectifying nonlinearity was in intended tended to capture learning algorithms. Glorot et al. ( 2011a ) motiv ate rectified linear neurons units from these prop properties erties of biological neurons: 1) For some inputs, biological are biological considerations. half-rectifying nonlinearity was intended to ortional capture completely inactive. 2) ForThe some inputs, a biological neuron’s output is prop proportional these erties of biological neurons: 1) Forneurons some inputs, neurons are to its prop input. 3) Most of the time, biological op operate eratebiological in the regime where completely inactive. 2) they For some inputs, biological neuron’s they are inactive (i.e., should hav haveeasp sparse arse activations activations). ). output is prop ortional to its input. 3) Most of the time, biological neurons op erate in the regime where the mo modern dern they resurgence of deep learning b egan). in 2006, feedforward theyWhen are inactive (i.e., should hav e sparse activations net networks works contin continued ued to hav havee a bad reputation. From about 2006-2012, it was widely When the mo dern resurgence of deep learning b egan in 2006, feedforward networks continued to have a bad reputation. From about 2006-2012, it was widely 226
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
b eliev elieved ed that feedforward netw networks orks would not p erform well unless they were assisted by other mo models, dels, such as probabilistic mo models. dels. To day day,, it is now known that with the b eliev ed that feedforward netw orks would not p erform unless pthey werevery assisted righ rightt resources and engineering practices, feedforw feedforward ardwell net networks works erform well. b y other mo dels, such as probabilistic mo dels. T o day , it is now known that with the To day day,, gradien gradient-based t-based learning in feedforw feedforward ard net networks works is used as a tool to develop righ t resources and engineering practices, feedforw ard net works p erform very w ell. probabilistic mo models, dels, such as the variational auto autoenco enco encoder der and generative adv adversarial ersarial T o day , gradien t-based in 20 feedforw ard than networks is used as aastool develop net networks, works, describ described ed inlearning Chapter . Rather b eing view viewed ed an to unreliable probabilistic mo dels, as the variational enco der gradient-based and generative adv ersarial tec technology hnology that mustsuch b e supp supported orted by otherauto techniques, learning in net works, describ ed in Chapter 20 . Rather than b eing view ed as an unreliable feedforw feedforward ard net networks works has b een view viewed ed since 2012 as a p ow owerful erful technology that tec hnology that m ust b e supp orted by other techniques, gradient-based learning in ma may y b e applied to many other mac machine hine learning tasks. In 2006, the communit community y feedforw ardervised networks has bto eensupp view edsup since 2012learning, as a p owand erfulno technology that used unsup unsupervised learning support ort supervised ervised now, w, ironically ironically, , it mamore y b e common applied to other maclearning hine learning tasks. In 2006, thelearning. community is to many use sup supervised ervised to supp support ort unsup unsupervised ervised used unsup ervised learning to supp ort sup ervised learning, and now, ironically, it Feedforward netw networks orks contin continue ue to ha have ve unfulfilled p otential. In the future, we is more common to use sup ervised learning to supp ort unsup ervised learning. exp expect ect they will b e applied to man many y more tasks, and that adv advances ances in optimization Feedforward netw contin ue improv to havee unfulfilled p otential.even In the future, we algorithms and mo model delorks design will improve their p erformance further. This ect they b e applied to man more tasks, andork that advances in dels. optimization cexp hapter has will primarily describ described ed ythe neural netw network family of mo models. In the algorithms and mo del design will improv e their p erformance even further. subsequen subsequentt chapters, we turn to how to use these mo models—ho dels—ho dels—how w to regularizeThis and ctrain hapter has primarily describ ed the neural netw ork family of mo dels. In the them. subsequent chapters, we turn to how to use these mo dels—how to regularize and train them.
227
Chapter 7 Chapter 7
Regularization for Deep Learning Regularization for Deep Learning A cen central tral problem in mac machine hine learning is ho how w to mak makee an algorithm that will perform well not just on the training data, but also on new inputs. Man Many y strategies A cenintral problem in mac hine learningdesigned is how to to reduce make an will used mac machine hine learning are explicitly thealgorithm test error,that possibly perform well notofjust on the training training error. data, but alsostrategies on new inputs. Manycollectiv strategies at the exp expense ense increased These are known collectively ely used in mac hine learning are explicitly designed to reduce the test error, possibly as regularization. As we will see there are a great many forms of regularization the expto ense increased training error. These are known collectiv ely aatvailable theof deep learning practitioner. In strategies fact, dev developing eloping more effective as regularization. As wehas willbeen see there great many forms of regularization regularization strategies one ofare theama major jor researc research h efforts in the field. available to the deep learning practitioner. In fact, developing more effective Chapter 5 introduced the basic concepts of generalization, underfitting, ov overfiterfitregularization strategies has been one of the ma jor research efforts in the field. ting, bias, variance and regularization. If you are not already familiar with these Chapter 5 introduced thecbasic of generalization, underfitting, overfitnotions, please refer to that hapterconcepts before contin one. continuing uing with this ting, bias, variance and regularization. If you are not already familiar with these In this chapter, e describ describe e regularization in more focusing notions, please referwto that chapter before contin uing detail, with this one. on regularization strategies for deep models or mo models dels that ma may y be used as building blo blocks cks In this c hapter, w e describ e regularization in more detail, focusing on regularto form deep models. ization strategies for deep models or models that may be used as building blocks Some sections of this chapter deal with standard concepts in machine learning. to form deep models. If you are already familiar with these concepts, feel free to skip the relev relevant ant Some sections of this chapter deal with standard concepts in machine learning. sections. Ho How wev ever, er, most of this chapter is concerned with the extension of these If you are already withcase these concepts, feel basic concepts to thefamiliar particular of neural netw networks. orks.free to skip the relevant sections. However, most of this chapter is concerned with the extension of these In Sec. 5.2.2, we defined regularization as “an “any y modification we mak makee to a basic concepts to the particular case of neural networks. learning algorithm that is intended to reduce its generalization error but not In Sec. 5.2.2 , weThere defined as “any strategies. modification we mak to a its training error.” areregularization many regularization Some put eextra learning algorithm that is intended to reduce its generalization error but constrain constraints ts on a mac machine hine learning model, suc such h as adding restrictions on not the its trainingvalues. error.” Some Thereadd areextra manyterms regularization strategies. Some putcan extra parameter in the ob objectiv jectiv jective e function that be constrain ts on a mac hine learning model, suc h as adding restrictions on the though thoughtt of as corresponding to a soft constrain constraintt on the parameter values. If chosen parameter v alues. Some add extra terms in the jectiv function that can be carefully carefully,, these extra constrain constraints ts and penalties canob lead toeimpro improv ved performance thought of as corresponding to a soft constraint on the parameter values. If chosen carefully, these extra constraints and penalties can lead to improved performance 228 228
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
on the test set. Sometimes these constrain constraints ts and penalties are designed to encode sp specific ecific kinds of prior kno knowledge. wledge. Other times, these constrain constraints ts and penalties on the test set. Sometimes these constrain ts and p enalties are designed encode are designed to express a generic preference for a simpler model class intoorder to sp ecific kinds of prior kno wledge. Other times, these constrain ts and p enalties promote generalization. Sometimes penalties and constraints are necessary to make areunderdetermined designed to express a generic preference for aforms simpler model class inknown order to an problem determined. Other of regularization, as promote Sometimes penalties and constraints arethe necessary make ensem ensemble blegeneralization. metho methods, ds, com combine bine multiple hyp ypotheses otheses that explain trainingtodata. an underdetermined problem determined. Other forms of regularization, known as In the context of deep learning, most regularization strategies are based on ensemble methods, combine multiple hypotheses that explain the training data. regularizing estimators. Regularization of an estimator works by trading increased context of deep An learning, most regularization strategies based on biasInforthe reduced variance. effective regularizer is one that mak makes esare a profitable regularizing estimators. of annot estimator orks by trading increased trade, reducing varianceRegularization significantly while ov overly erly w increasing the bias. When bias for reduced v ariance. An effective regularizer is one that mak es a profitable we discussed generalization and ov overfitting erfitting in Chapter 5, we fo focused cused on three trade, reducing v ariance significantly while not ov erly increasing the bias. situations, where the mo model del family being trained either (1) excluded theWhen true w e discussed generalization and ov erfitting in Chapter 5 , w e fo cused on three data generating process—corresp process—corresponding onding to underfitting and inducing bias, or (2) situations, the mo del family bcess, eing or trained either the (1) generating excluded the true matc matched hed thewhere true data generating pro process, (3) included pro process cess data generating process—corresp onding to underfitting and inducing bias, or (2) but also man many y other possible generating processes—the ov overfitting erfitting regime where matc hed rather the true data or (3) included the generating process v ariance than biasgenerating dominatespro thecess, estimation error. The goal of regularization but also man y other p ossible generating processes—the ov erfitting regime where is to tak takee a mo model del from the third regime in into to the second regime. variance rather than bias dominates the estimation error. The goal of regularization In practice, an ov overly erly complex mo model del family does not necessarily include the is to take a model from the third regime into the second regime. target function or the true data generating pro process, cess, or even a close appro approximation ximation In practice, an ov erly complex mo del family does not necessarily include of either. We almost never hav havee access to the true data generating pro process cessthe so target function or the true data generating pro cess, or even a close appro ximation we can nev never er kno know w for sure if the mo model del family being estimated includes the of either. W e cess almost never hav e access the true data generating process so generating pro process or not. How However, ever, most to applications of deep learning algorithms we can never kno w for the generating model family being estimated includes the are to domains where thesure trueifdata pro process cess is almost certainly outside generating cess. or not.learning However,algorithms most applications of deep learning the mo model del pro family family. Deep are typically applied to algorithms extremely are to domains where the true data generating pro cess is almost certainly outside complicated domains such as images, audio sequences and text, for which the true the model family . Deep learning algorithms are typically applied to extremely generation pro process cess essentially inv involves olves simulating the entire universe. To some complicated domains asto images, audio psequences and generating text, for which the true exten extent, t, we are alwa always ys such trying fit a square eg (the data pro process) cess) into generation pro cess essentially inv olves simulating the entire universe. T o some a round hole (our model family). extent, we are always trying to fit a square peg (the data generating process) into What this means is that con controlling trolling the complexity of the mo model del is not a a round hole (our model family). simple matter of finding the mo model del of the right size, with the right num numb ber of What this means is that con trolling the complexity of the mo del is not a parameters. Instead, we migh mightt find—and indeed in practical deep learning scenarios, simple matter of do finding the mo delbof right size, with right number of w e almost alw alwa ays find—that the estthe fitting mo model del (in thethe sense of minimizing parameters. Instead, we migh t find—and indeed in practical deep learning scenarios, generalization error) is a large model that has been regularized appropriately appropriately. . we almost always do find—that the best fitting model (in the sense of minimizing We now review several strategies for ho how w to create such a large, deep, regularized generalization error) is a large model that has been regularized appropriately. mo model. del. We now review several strategies for how to create such a large, deep, regularized model.
229
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.1
Parameter Norm Penalties
Regularization has been used for decades prior to the adv advent ent of deep learning. Linear 7.1 Parameter Norm Penalties mo models dels such as linear regression and logistic regression allow simple, straightforw straightforward, ard, Regularization has b een used for decades prior to the adv ent of deep learning. Linear and effectiv effectivee regularization strategies. models such as linear regression and logistic regression allow simple, straightforward, Man Many y regularization approac approaches hes are based on limiting the capacit capacity y of mo models, dels, and effective regularization strategies. suc such h as neural netw networks, orks, linear regression, or logistic regression, by adding a paManynorm regularization approac are based on limiting capacit of models, θ J . Wethe rameter penalty Ω( Ω(θ ) to thehes ob objective jective function denote theyregularized suc h as neural netw orks, linear regression, or logistic regression, by adding a paob objectiv jectiv jectivee function by J˜: rameter norm penalty Ω(θ) to the objective function J . We denote the regularized ob jective function by J˜:J˜(θ; X , y) = J (θ; X , y) + αΩ(θ) (7.1) J˜(θ; X , y) = J (that θ; X ,wyeigh ) + ts αΩ( θ) relative con (7.1) where α ∈ [0, ∞) is a hyperparameter eights the contribution tribution of the norm penalty term, Ω, relative to the standard ob objectiv jectiv jectivee function J (x; θ). α , [0 where ) is a hyperparameter that w eigh ts the relative conond tribution of α corresp Setting α to 0 results in no regularization. Larger values of correspond to more the norm∈penalty ∞ term, Ω, relative to the standard ob jective function J (x; θ). regularization. Setting α to 0 results in no regularization. Larger values of α correspond to more When our training algorithm minimizes the regularized ob objective jective function J˜ it regularization. will decrease both the original ob objectiv jectiv jectivee J on the training data and some measure J˜ itt When our training algorithm minimizes the regularized ob jective function of the size of the parameters θ (or some subset of the parameters). Differen Different decrease the original thedifferen training data andbeing somepreferred. measure J on in cwill hoices for theboth parameter normobΩjectiv can eresult different t solutions θ (or some of this the size of the the parameters). In section, we parameters discuss the effects of the subset variousofnorms when used as Differen penaltiest choices for theparameters. parameter norm Ω can result in different solutions being preferred. on the model In this section, we discuss the effects of the various norms when used as penalties Before delving in into to the regularization behavior of different norms, we note that on the model parameters. for neural net networks, works, we typically cho hoose ose to use a parameter norm penalty Ω that Before delving in to the regularization of different norms, weand note that penalizes only the weigh transformation at eac each h lay layer er leav leaves es weights ts of the affinebehavior for neural net works, w e typically c ho ose to use a parameter norm penalty Ω that the biases unregularized. The biases typically require less data to fit accurately only theEac weigh ts oft the penalizes affine how transformation at eac h layerFitting and leav es than the w eigh eights. ts. Each h weigh weight sp specifies ecifies two variables interact. the the biases The biases lessydata to fit accurately weigh requires observing both vtypically ariables require in a variet of conditions. Each eight t wellunregularized. ariety than the w eigh ts. Eac h weigh t sp ecifies how t wo v ariables interact. Fitting the bias con controls trols only a single variable. This means that we do not induce to too o muc uch h w eigh t w ell requires observing b oth v ariables in a v ariet y of conditions. Each variance by lea leaving ving the biases unregularized. Also, regularizing the bias parameters bias con trols only a singlet v ariable.ofThis means that do not induce o mucwh can in intro tro troduce duce a significan significant amount underfitting. Wewe therefore use the to vector variance by all leaving biases unregularized. Also, regularizing biasyparameters to indicate of thethe weigh eights ts that should be affected by a normthe penalt enalty , while the can intro a significan t amount of underfitting. e therefore useunregularized the vector w θ duce w and the v ector denotes all of the parameters, including bW oth to indicate all of the weights that should be affected by a norm penalty, while the parameters. vector θ denotes all of the parameters, including both w and the unregularized In the con context text of neural net networks, works, it is sometimes desirable to use a separate parameters. α penalt enalty y with a different co coefficien efficien efficientt for eac each h la lay yer of the netw network. ork. Because it can In the con text of hneural net works,vit is sometimes to use a separate be exp expensiv ensiv ensive e to searc search for the correct alue of multiple desirable hyp yperparameters, erparameters, it is still α p enalt y with a different co efficien t for eac h la y er of the netw ork. Because it can reasonable to use the same weigh weightt deca decay y at all la lay yers just to reduce the searc search h b e exp ensiv e to searc h for the correct v alue of multiple h yp erparameters, it is still space. reasonable to use the same weight decay at all layers just to reduce the search 230 space.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
L2 Parameter Regularization L W e ha hav ve already seen, in Sec. 5.2.2, one of the simplest and most common kinds 7.1.1 Parameter Regularization 7.1.1
of parameter norm penalty: the L2 parameter norm penalty commonly kno known wn as W e ha v e already seen, in Sec. 5.2.2 , one of the simplest and most common kinds 1 weight de deccay ay.. This regularization strategy driv drives es the weigh weights ts closer to the origin of parameter norm penalty: the parameter penalty commonly kno wn as L 1 kwnorm 2 θ ) = 2 k2 to the ob by adding a regularization term Ω( Ω(θ objectiv jectiv jectivee function. In weight de c ay . This regularization strategy drivis es also the known weightsascloser the origin 2 other academic comm communities, unities, L regularization ridgeto regr gression ession or w θ ) = b y adding a regularization term Ω( to the ob jectiv e function. In Tikhonov regularization gularization.. other academic communities, L regularization k isk also known as ridge regression or W e can gain some insight into the b ehavior of weigh weightt decay regularization Tikhonov regularization. by studying the gradient of the regularized ob objective jective function. To simplify the Wetation, can gain some insight the behavior weigh decay regularization w . tSuc presen presentation, we assume no biasinto parameter, so θ isofjust Such h a mo model del has the b y studying the gradient of the regularized ob jective function. T o simplify the follo following wing total ob objective jective function: presentation, we assume no bias parameter, so θ is just w . Such a model has the α (7.2) J˜(w ; X , y) = w>w + J (w; X , y ), following total objective function: 2 α (7.2) (w; X , y) =gradient w w + J (w; X , y ), with the correspondingJ˜parameter 2 ∇wparameter J˜(w; X , y)gradient = αw + ∇w J (w; X , y). (7.3) with the corresponding To tak takee a single gradien gradienttJ˜step to, up update the eights, this update: (w; X y) date = αw + weigh J (ts, w; w Xe, p yerform ). (7.3) ∇wt ← ∇ w to − up (αdate w +the ∇ww J eigh (w; X y)) (7.4) To take a single gradien step ts,, w e p. erform this update: Written another way ay,, the w is: w update (αw + J (w; X , y)) . w← ← (1 − − α α) Written another way, the update is:)w − ∇∇w J (w; X , y).
(7.4) (7.5)
We can see that the addition decay modified dified the learning w (1of the α)w weight deca J (ywterm ; X , yhas ). mo (7.5) rule to multiplicativ multiplicatively ely shrink the weigh weight t vector b y a constan constant t factor on each step, ← of−the weight − ∇decay term has modified the learning W e can see that the addition just before performing the usual gradient up update. date. This describ describes es what happ happens ens in to multiplicativ ely shrink theov weigh t vector y a constan t factor on each step, arule single step. But what happens over er the entire bcourse of training? just before performing the usual gradient update. This describes what happens in We will further simplify the analysis by making a quadratic approximation a single step. But what happens over the entire course of training? to the ob objective jective function in the neighborho neighborhoo od of the value of the weigh weights ts that W e will further simplify the analysis by making a quadratic approximation ∗ obtains minimal unregularized training cost, w = arg minw J (w). If the ob objective jective to the obisjective function inasthe neighborho od of the valueregression of the weigh ts with that function truly quadratic, in the case of fitting a linear model arg min J (w). If the objective obtainssquared minimal unregularized training cost, wis = mean error, then the appro approximation ximation perfect. function is truly quadratic, as in the case of fitting a linear regression model with 1 ∗ mean squared error,Jˆthen appro perfect. (θ) =the J (w ) +ximation (w − wis∗)> H (w − w ∗) (7.6) 2 1 1 (θ) =regularize J (w ) +the (parameters w w ) to Hbe (wnearwany ) specific point in (7.6) More generally, weJˆcould space 2 and, surprisingly, still get a regularization effect,− but better results will be obtained for a value − closer to the true one, with zero being a default value that makes sense when we do not know if the correct value should be positive or negative. Since it is far more common to regularize the model parameters towards zero, we will focus on this special case in our exposition. 231
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
where H is the Hessian matrix of J with resp respect ect to w ev evaluated aluated at w∗ . There is no first-order term in this quadratic approximation, because w∗ is defined to be a H iswhere w evaluated where the Hessian matrix of J withLikewise, respect to at w . There minim minimum, um, the gradien gradient t vanishes. because lo location cation of is a w∗ is the w no first-order term in this quadratic approximation, because is defined to b e a minim minimum um of J , we can conclude that H is positive semidefinite. minimum, where the gradien t v anishes. Likewise, because is the lo cation of a w The minim minimum um of Jˆ occurs where its gradient minimum of J , we can conclude that H is positive semidefinite. ˆ The minimum of Jˆ occurs∇where = gradient H (w − w ∗) (7.7) w J (w )its is equal to 0.
Jˆ(w) = H (w
w )
(7.7)
∇ t deca −dify Eq. 7.7 by adding the weigh To study weight decay y, we mo modify eightt is equal to 0. the effect of weigh deca decay y gradien gradient. t. We can no now w solv solvee for the minim minimum um of the regularized version of Jˆ. o study the effect of represen weight deca we modify Eq.minimum. 7.7 by adding the weight ˜ to We T use the variable w represent t they,location of the decay gradient. We can now solve for the minimum of the regularized version of Jˆ. ˜ to represen We use the variable w t the location ˜+ ˜ − w∗ ) of = 0the minimum. (7.8) αw H (w ∗ (H + αI )w ˜ = Hw (7.9) ˜ + H (w ˜ w )=0 (7.8) αw −1 ∗ w ˜ = ( H + α I ) H w . (7.10) − (H + αI )w ˜ = Hw (7.9)
˜ = (H + αsolution I ) H ww ˜. approac As α approac approaches hes 0, the w regularized approaches hes w∗ . But (7.10) what happ happens ens as α gro grows? ws? Because H is real and symmetric, we can decomp decompose ose it ˜ ofapproac α approac w hes 0,Λ the solution hes w Q . , But in into toAs a diagonal matrix andregularized an orthonormal basis eigen eigenvectors, vectors, suc such hwhat that α H happ ens as gro ws? Because is real and symmetric, we can decomp ose it > H = QΛQ . Applying the decomp decomposition osition to Eq.7.10, we obtain: into a diagonal matrix Λ and an orthonormal basis of eigenvectors, Q, such that > ,∗we obtain: H = QΛQ . Applyingw˜the toQEq. 7.10 = (decomp QΛQ >osition + αI ) −1 ΛQ w (7.11) h i−1 w˜ = (Q Q+ + QΛ (Λ αIα)IQ)> QΛ QQ ΛQw> w∗ (7.11) (7.12) > ∗ = QQ (Λ I I)−1 .Q w (Λ++αα )QΛQ w QΛ
(7.13) (7.12)
=eigh Q(Λ + αIy)is to ΛQrescale w . w ∗ along the axes defined (7.13) We see that the effect of w eight t deca decay rescalew by ∗ the eigen eigenv vectors of H . Sp Specifically ecifically ecifically,, the comp componen onen onentt of w that is aligned with the e see that the effect of rescaled weigh y factor is to irescale along the axes by iw h t deca iW -th eigenv eigenvector ector of H is by a of λiλ+α . (Y (You ou may wishdefined to review the eigen v ectors of . Sp ecifically , the comp onen t of that is aligned with the H w ho how w this kind of scaling works, first explained in Fig. 2.3). i-th eigenvector of H is rescaled by a factor of . (You may wish to review Along the directions where the eigenv eigenvalues alues of H are relatively large, for example, how this kind of scaling works, first explained in Fig. 2.3). where λi α, the effect of regularization is relativ relatively ely small. Ho How wev ever, er, comp componen onen onents ts H Along the directions where the eigenv alues of are relatively large, for example, with λi α will be shrunk to hav havee nearly zero magnitude. This effect is illustrated α λ where , the effect of regularization is relatively small. However, components in Fig. 7.1. with λ α will be shrunk to have nearly zero magnitude. This effect is illustrated Only directions along which the parameters con contribute tribute significan significantly tly to reducing in Fig. 7.1 . the ob objectiv jectiv jectivee function are preserved relativ relatively ely in intact. tact. In directions that do not Only directions alongthe which the parameters tribute significan tlythe to reducing con contribute tribute to reducing ob objective jective function, con a small eigenv eigenvalue alue of Hessian the ob jectiv e function are preserved relativ ely in tact. In directions that do not tells us that mov movement ement in this direction will not significantly increase the gradien gradient. t. contribute to reducing the ob jective function, a small eigenvalue of the Hessian tells us that movement in this direction232 will not significantly increase the gradient.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
w2
w∗ ˜ w
w1
Figure 7.1: An illustration of the effect of L 2 (or weight decay) regularization on the value of the optimal w. The solid ellipses represen representt contours of equal value of the unregularized L (or wof Figure 7.1: The An illustration of the effect ofcontours eight decay) on the value ob objective. jective. dotted circles represent equal valueregularization of the L2 regularizer. At w of the optimal . The solid ellipses represen t contours of equal In value thedimension, unregularized w ˜ , these the p oint competing ob objectives jectives reach an equilibrium. the of first the L regularizer. ob jective. dotted circles of equal value of the At eigen eigenv value The of the Hessian of Jrepresent is small. contours The ob objective jective function do does es not increase muc much h w ˜ ∗ the p oint , these competing jectives an equilibrium. the firstdo dimension, the w .reach when moving horizon horizontally tally awa way yobfrom Because the ob objective jectiveInfunction does es not express J direction, value of the Hessian is small. The jective function do es not increase h aeigen strong preference along of this the ob regularizer has a strong effect on thismuc axis. w when moving horizon tally a wa y from . Because the ob jective function do es not express The regularizer pulls w1 close to zero. In the second dimension, the ob objective jective function ∗ . The corresp a strong preference along this direction, thewregularizer has aonding strongeigenv effectalue on this axis. is very sensitiv sensitive e to mo mov vements awa way y from corresponding eigenvalue is large, w close The regularizer pullsature. zero. weigh In thet second dimension, the ob jective function indicating high curv curvature. As atoresult, weight decay affects the p osition of w2 relativ relatively ely w is v ery sensitiv e to mo v ements a wa y from . The corresp onding eigenv alue is large, little. indicating high curvature. As a result, weight decay affects the p osition of w relatively little.
233
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Comp Componen onen onents ts of the weight vector corresponding to suc such h unimportant directions are deca decay yed awa way y through the use of the regularization throughout training. Components of the weight vector corresponding to such unimportant directions So far we hav havee discussed weight decay in terms of its effect on the optimization are decayed away through the use of the regularization throughout training. of an abstract, general, quadratic cost function. Ho How w do these effects relate to So far we hav e discussed w eight decay in terms of its effect on the regression, optimizationa mac machine hine learning in particular? We can find out by studying linear of an costisfunction. w therefore do these amenable effects relate to mo model del abstract, for whic which hgeneral, the true quadratic cost function quadratic Ho and to the machine learning in particular? We can findApplying out by studying linearagain, regression, same kind of analysis we ha have ve used so far. the analysis we willa mo h the costcase function quadratic andbut therefore amenable the b e del ablefor towhic obtain a true sp special ecial of theissame results, with the solutiontonow same kind of analysis we ha ve used so far. Applying the analysis again, we will phrased in terms of the training data. For linear regression, the cost function is be able special case of the same results, but with the solution now the sum to of obtain squareda errors: phrased in terms of the training data. For linear regression, the cost function is (7.14) the sum of squared errors: (X w − y )> (X w − y ). When we add L2 regularization, the ob objective changes to (X w y )jective (X w function y ). − −1 > ob jective function When we add L regularization, changes to (X w − y )the (X w − y ) + αw > w . 2 1 (Xequations w y ) (for X wthe solution y ) + αw w. This changes the normal 2 from − − −1 > w = (for X >the X )solution X y from This changes the normal equations to
w = (X X ) X y w = (X >X + αI )−1X >y.
(7.14) (7.15) (7.15) (7.16) (7.16) (7.17)
to 1 > X. The matrix X >X in Eq. 7.16 prop proportional covariance ariance matrix m X w =is(X Xortional + αI ) toXthe y.cov (7.17) −1 Using L 2 regularization replaces this matrix with X >X + αI in Eq. 7.17. The matrix X X in Eq. 7.16 is proportional to the covariance matrix X X . The new matrix is the same as the original one, but with the addition of α to the X αI variance Using L regularization replaces in Eq.of 7.17 diagonal. The diagonal entries of this this matrix matrix with corresp correspond ondXto+the eac each h. The new matrixWise the as the one, butcauses with the α to the L2 original input feature. can same see that regularization the addition learning of algorithm diagonal. The diagonal entries of this matrix corresp ond to the v ariance of eac to “p “perceive” erceive” the input X as having higher variance, whic which h makes it shrink theh L regularization input feature. We can see cov that causes the is learning algorithm weigh on features whose with the output target low compared to eights ts covariance ariance X to “p erceive” the input as having higher v ariance, whic h makes it shrink the this added variance. weights on features whose covariance with the output target is low compared to this added variance.
L1 Regularization L 7.1.2 L 2 weigh Regularization While eight t deca decay y is the most common form of weight deca decay y, there are other
7.1.2
ways to penalize the size of the mo model del parameters. Another option is to use L 1 While L weight decay is the most common form of weight decay, there are other regularization. ways to penalize the size of the model parameters. Another option is to use L Formally ormally,, L1 regularization on the model parameter w is defined as: regularization. X Formally, L regularization w is defined as: (7.18) Ω(θon ) =the ||wmodel || 1 = parameter |wi |, Ω(θ) = w234 = || ||
i
X
w , | |
(7.18)
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
that is, as the sum of absolute values of the individual parameters.2 We will no now w discuss the effect of L1 regularization on the simple linear regression mo model, del, that is, as the sum of absolute v alues of the individual parameters. W e will 2 with no bias parameter, that we studied in our analysis of L regularization. In now discusswethe of L in regularization on differences the simpleblinear del, L1 and L2mo particular, areeffect in interested terested delineating the et etw weenregression forms with no bias parameter, that w e studied in our analysis of regularization. In L of regularization. As with L 2 weigh eightt deca decay y, L1 weigh eightt deca decay y con controls trols the strength L erparameter particular, we are interested in delineating theΩdifferences between and L forms α. of the regularization by scaling the penalty using a positive hyp yperparameter L L of regularization. As with w eigh t deca y , w eigh t deca y con trols the strength ˜ Th Thus, us, the regularized ob objective jective function J (w; X , y) is given by of the regularization by scaling the penalty Ω using a positive hyperparameter α. Thus, the regularized ob ;X y); X is ,given J (, w y), by (7.19) J˜ (jective w; X , yfunction ) = α||wJ˜||(1w+ + J (w; X ,t): y), (w; X , y(actually ) = α w, sub-gradien with the correspondingJ˜gradient (actually, sub-gradient): || || with the corresponding sub-gradien t):, y; w) ∇w J˜(gradient w; X , y)(actually = αsign(, w ) + ∇w J (X
(7.19) (7.20)
J˜(w ; Xsign , y) of =α w) +elemen J (t-wise. X , y; w) (7.20) where sign(w) is simply the wsign( applied element-wise. ∇ ∇ By sign( insp inspecting ecting Eq. 7.20 we can immediately that the effect of L 1 reguwhere w ) is simply the, sign of wsee applied elemen t-wise. larization is quite differen differentt from that of L2 regularization. Sp Specifically ecifically ecifically,, we can By insp ecting Eq. 7.20 , w e can see immediately that the effect of Llinearly regusee that the regularization contribution to the gradient no longer scales larization differen that factor of L regularization. Specifically can with each is instead it is tafrom constant with a sign equal to sign . One w i;quite sign((w, i)we see that the regularization no longer scalessee linearly consequence of this form ofcontribution the gradien gradientt to is the thatgradient we will not necessarily clean with each ; instead it is a constant factor with a sign equal to ) . w sign ( w L2 algebraic solutions to quadratic appro approximations ximations of J (X , y; w) as we did forOne consequence of this form of the gradien t is that we will not necessarily see clean regularization. algebraic solutions to quadratic approximations of J (X , y; w) as we did for L Our simple linear mo model del has a quadratic cost function that we can represen representt regularization. via its Taylor series. Alternately Alternately,, we could imagine that this is a truncated Taylor Our simple linear mocost del has a quadratic cost function that we can series appro approximating ximating the function of a more sophisticated mo model. del. Therepresen gradientt viathis its T aylor series. Alternately , we could imagine that this is a truncated Taylor in setting is giv given en by series approximating the cost function of a more sophisticated model. The gradient in this setting is given by ∇w Jˆ(w) = H (w − w∗), (7.21) ˆ(w) =ofHJ(w ), ect to w ev where, again, H is the Hessian J matrix withwresp respect evaluated aluated at(7.21) w∗ . ∇es not admit clean − algebraic expressions in the case Because the L1 penalt enalty y do does where, again, H is the Hessian matrix of J with respect to w evaluated at w . of a fully general Hessian, we will also mak makee the further simplifying assumption the L ispenalt y doesHnot clean algebraic case = admit diag diag([ ([ ([H H thatBecause the Hessian diagonal, ]),,expressions where eac each hinHthe 1,1 , . . . , Hn,n ]) i,i > 0. of a fully general Hessian, we will the further simplifying assumption This assumption holds if the dataalso for mak the elinear regression problem has been H = diag ([ H , . . . , H > that the Hessian is diagonal, ]) , where eac h prepro preprocessed cessed to remo remove ve all correlation betw etween een the input features, whic which hHma may y b0e. This assumption holds if the data for the linear regression problem has b een accomplished using PCA. preprocessed to remove all correlation between the input features, which may be 2 As with L 2 regularization, parameters towards a value that is not accomplished using PCA. we could regularize(othe ) 1 zero, but instead towards some parameter value that case the L regularization would P w . In (o) introduce the term Ω(θ ) = ||w − w(o) || 1 = i |wi − w i |. 235
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Our quadratic appro approximation ximation of the L1 regularized ob objective jective function decomposes in into to a sum over the parameters: Our quadratic approximation of theL regularized objective function decom X 1 poses inˆto a sum over the∗parameters: J (w; X , y) = J (w ; X , y) + Hi,i (wi − w ∗i )2 + α|wi | . (7.22) 2 i 1 Jˆ(w; X , y) = J (w ; X , y) + H (w w ) + α w . (7.22) 2 cost function has an analytical solution The problem of minimizing this approximate − | | (for eac each h dimension i), with the follo following wing form: The problem of minimizing this approximate cost functionhas an analytical solution α (for each dimension i), with the follo wing form: (7.23) wi = sign( sign(w w∗i )X max |w∗i | − ,0 . H i,i α (7.23) w = sign(∗ w ) max w ,0 . Consider the situation where wi > 0 for all i. There two o possible outcomes: H are tw | |− Consider theαsituation where w > 0 for all i. There are two possible objectiv outcomes: 1. w∗i ≤ jectiv jectivee is Hi,i . Here the optimal value of wi under the regularized ob simply wi = 00.. This occurs because the contribution of J (w ; X , y ) to the 1. w . Here the optimal value of w under the regularized ob jective is1 regularized ob objectiv jectiv jectivee J˜(w ; X , y ) is overwhelmed—in direction i—b —by y the L J ( w ; X , y simply This o ccurs b ecause the contribution of ) to the ≤ w = 0.whic regularization pushes the v alue of w to zero. which h i ˜ regularized ob jective J (w ; X , y ) is overwhelmed—in direction i—by the L ∗ > α 2. w the hregularization doesofnot move e the optimal value of wi to regularization pushes the value w mov to zero. i Hi,i , herewhic zero but instead it just shifts it in that direction by a distance equal to Hαi,i . 2. w > , here the regularization does not move the optimal value of w to zero but instead it just shifts it in that direction by a distance equal w to . A similar process happ happens ens when w ∗i < 0, but with the L1 penalt enalty y making i less negativ negativee by Hαi,i , or 0. A similar process happens when w < 0, but with the L penalty making w less 2 1 In comparison negativ e by , orto0.L regularization, L regularization results in a solution that is more sp sparse arse arse.. Sparsit Sparsity y in this con context text refers to the fact that some parameters L L yregularization In comparison to regularization, results in that ha hav ve an optimal value of zero. The sparsit sparsity of L1 regularization is aa solution qualitativ qualitatively ely is more sp arse . Sparsit y in this con text refers to the fact that some parameters 2 ˜ differen differentt beha ehavior vior than arises with L regularization. Eq. 7.13 gav gavee the solution w ha v e an optimal v alue of zero. The sparsit y of regularization is a qualitativ ely L 2 for L regularization. If we revisit that equation using the assumption of a diagonal w ˜ differentH beha viorwthan arises with 7.13 gave the Hessian that e introduced for Lourregularization. analysis of L 1 Eq. regularization, wesolution find that Hi,i L regularization. for usingnonzero. the assumption of a diagonal ∗ ∗ If we revisit that equation w ˜i = ˜ i remains This demonstrates Hi,i+α wi . If wi was nonzero, then w Hessian H that we introduced for our analysis of L regularization, we find that that L2 regularization does not cause the parameters to become sparse, while L 1 w ˜ = w . If w was nonzero, then w ˜ remains nonzero. This demonstrates regularization ma may y do so for large enough α. that L regularization does not cause the parameters to become sparse, while L The sparsit sparsity y property induced by L1 regularization has been used extensively regularization may do so for large enough α. as a fe featur atur aturee sele selection ction mechanism. Feature selection simplifies a mac machine hine learning L The sparsit y property induced b y regularization has been used problem by choosing which subset of the av available ailable features should bextensively e used. In as a fe atur e sele ction mechanism. F eature selection simplifies a mac hine learning particular, the well kno known wn LASSO (Tibshirani, 1995) (least absolute shrink shrinkage age and problem op byerator) choosing which subset an of L the 1 available features should b e used. In selection operator) mo model del in integrates tegrates penalt enalty y with a linear mo model del and a least particular, thefunction. well known (Tibshirani ) (least absolute shrink and L 1 penalt squares cost TheLASSO enalty y causes, a1995 subset of the weights to bage ecome L selection op erator) mo del in tegrates an p enalt y with a linear mo del and a least zero, suggesting that the corresponding features may safely be discarded. squares cost function. The L penalty causes a subset of the weights to become zero, suggesting that the corresponding236 features may safely be discarded.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
In Sec. 5.6.1, we sa saw w that man many y regularization strategies can be in interpreted terpreted as MAP Ba Bay yesian inference, and that in particular, L 2 regularization is equiv equivalent alent In Sec. 5.6.1 , we sa w that man y regularization strategies can b e in terpreted as 1 to MAP Ba Bay yesian inference with a P Gaussian prior on the weigh weights. ts. F For or L reguL MAP Bayesian and regularization is function equivalent Ω(w w ) that = α ini particular, |w i| used to regularize larization, the pinference, enalt enalty y αΩ( a cost is to MAP Ba y esian inference with a Gaussian prior on the weigh ts. F or reguL equiv equivalen alen alentt to the log-prior term that is maximized by MAP Bay Bayesian esian inference = α distribution w used to(Eq. larization, the is penalt y αΩ(w )Laplace regularize a cost when the prior an isotropic 3.26) over w: function is equivalent to the log-prior term that is| maximized by MAP Bayesian inference | X X 1 when the prior is an isotropic Laplace distribution (Eq. 3.26) over w: |w i| + log α − log 2 (7.24) log p(w) = log Laplace( Laplace(w wi ; 0, ) = −α α i i 1 w + log α log 2 (7.24) log p(w) = log Laplace(w ; P 0, ) = α w From the point of view of learning viaα minimization with resp respect ect to , w e can − | | − ignore the log α − log 2 terms because they do not dep on w . depend end From the point of view of learning via minimization with respect to w , we can ignore the log α X log 2 terms because they do not X depend on w. 7.2 Norm −Penalties as Constrained Optimization Consider the costPfunction regularized by a parameterOptimization norm penalt enalty: y: 7.2 Norm enalties as Constrained Consider the cost function a parameter penalty: J˜(θregularized ; X , y) = J (bθy; X , y) + αΩ(θnorm ).
(7.25)
˜(θ; X , y) = J (θ; X , y) + αΩ(θ). (7.25) Recall from Sec. 4.4 Jthat we can minimize a function sub subject ject to constrain constraints ts by constructing a generalized Lagrange function, consisting of the original ob objectiv jectiv jectivee Recall from Sec. 4.4 that we can minimize a function sub ject to constrain ts by function plus a set of penalties. Eac Each h penalty is a pro product duct bet etw ween a coefficient, constructing a generalizeduck Lagrange function, consisting the original ob jective called a Karush–Kuhn–T Karush–Kuhn–Tuck ucker er (KKT) multiplier, and aof function representing functionthe plusconstrain a set oftpisenalties. product bΩ( etw a ecoefficient, θ)een whether constraint satisfied.Eac If hwepenalty wantedistoa constrain Ω(θ to b less than called a Karush–Kuhn–T uck er (KKT) multiplier, and a function representing some constan constantt k, we could construct a generalized Lagrange function whether the constraint is satisfied. If we wanted to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function L(θ, α; X , y) = J (θ; X , y) + α(Ω(θ) − k). (7.26) (θ,constrained α; X , y) = Jproblem (θ; X , yis ) +giv αen (Ω(by θ) k). The solution to the given L − The solution to the constrained (θ,en α).by θ ∗ = arg problem min maxisLgiv θ
α,α≥0
(7.26) (7.27)
(7.27) θ = arg min max (θ, α). θ As described in Sec. 4.4, solving this problem requires modifying b oth and L 2 α . Sec. 4.5 pro provides vides a work orked ed example of linear regression with an L constrain constraint. t. θ As described in Sec. 4.4 , solving this problem requires modifying b oth and Man Many y different pro procedures cedures are possible—some ma may y use gradient descent, while α . Sec.may L constrain 4.5 pro a worked example linear the regression with t. others usevides analytical solutions forofwhere gradien gradient t is an zero—but in all Man y different proincrease cedures are possible—some maydecrease use gradient descent, α must θ ) > k and θ)while < k. pro procedures cedures whenev whenever er Ω( Ω(θ whenev whenever er Ω( Ω(θ others may use analytical solutions for where the gradien t is zero—but in all ∗ θ) to shrink. The optimal value α will encourage Ω( θ) All positive α encourage Ω( Ω(θ Ω(θ α θ ) > k θ ) < k. pro cedures m ust increase whenev er Ω( and decrease whenev er Ω( to shrink, but not so strongly to mak makee Ω(θ) become less than k. All positive α encourage Ω(θ) to shrink. The optimal value α will encourage Ω(θ) to shrink, but not so strongly to make 237 Ω(θ) become less than k.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
To gain some insigh insightt into the effect of the constrain constraint, t, we can fix α ∗ and view the problem as just a function of θ : To gain some insight into the effect of the constraint, we can fix α and view (7.28) α∗θ): = arg minJ (θ; X , y) + α ∗Ω(θ). θ ∗just = arg min L(θ,of the problem as a function θ
θ
(7.28) θ = arg min (θ, α ) = arg minJ (θ; X , y) + α Ω(θ). This is exactly the same as the regularized training problem of minimizing J˜. L We can th thus us think of a parameter norm penalty as imp imposing osing a constrain constraintt on the J˜. This is exactly the same as the regularized training problem of lie minimizing 2 norm, L weigh If Ω is the then the w eights are constrained to in a L2 ball. eights. ts. W thus a parameter norm enalty as imp on the If eΩcan is the L1think norm,ofthen the weights arepconstrained toosing lie inaaconstrain region oft limited w ts. IfUsually Ω is theweL donorm, then thesize weights constrained to lie ball. 1 Leigh norm. not kno know w the of theare constrain constraint t region thatinwae Limpose If Ω is the L norm, then the w eights are constrained to lie in a region of limited by using weight decay with co coefficient efficient α∗ because the value of α ∗ do does es not directly L norm. Usually w e do not kno w one the can size solv of the constrain t region that webimpose tell us the value of k . In principle, for , but the relationship etw k solvee etween een α α b y using w eight decay with co efficient b ecause the v alue of do es not directly ∗ k and α dep depends ends on the form of J . While we do not kno know w the exact size of the tell us thet vregion, alue of w can solv for k , but the relationship betw een ke. In α in constrain constraint canprinciple, con control trol itone roughly bye increasing or decreasing order k α J and dep ends on the form of . While w e do not kno w the exact size of thet to gro grow w or shrink the constrain constraintt region. Larger α will result in a smaller constrain constraint constrain t region, e can control roughly by increasing region. Smaller αw will result in a itlarger constrain constraint t region.or decreasing α in order to grow or shrink the constraint region. Larger α will result in a smaller constraint Sometimes we may wish to use explicit constraints rather than penalties. As region. Smaller α will result in a larger constraint region. describ described ed in Sec. 4.4, we can modify algorithms suc such h as sto stochastic chastic gradient descen descentt Sometimes w e may wish to use explicit constraints rather than p enalties. As to take a step downhill on J (θ ) and then pro project ject θ bac back k to the nearest point describ ed in Sec. can modify h as stoidea chastic gradient descen θ4.4 ) <, we k . This that satisfies Ω( Ω(θ can bealgorithms useful if wsuc e hav have e an of what value of kt J (θt)toand θ back to appropriate take a step and downhill thentime pro ject thevalue nearest oint is do notonwan want sp spend end searching fortothe of α pthat θ )
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
consisten consistently tly increase the size of the weigh eights, ts, then θ rapidly mov moves es aw away ay from the origin un until til numerical ov overflow erflow occurs. Explicit constraints with repro reprojection jection θ consisten tly increase the size of the w eigh ts, then rapidly mov es aw ay from allo allow w us to terminate this feedback loop after the weights ha have ve reac reached hed a certain the origin unHin til numerical erflow occurs. Explicit repro jection magnitude. Hinton ton et al. (ov 2012c ) recommend using constraints constraints with com combined bined with a allo w us to terminate this feedback loop after the w eights ha ve reac hed a certain high learning rate to allo allow w rapid exploration of parameter space while main maintaining taining magnitude. Hin some stabilit stability y. ton et al. (2012c) recommend using constraints combined with a high learning rate to allow rapid exploration of parameter space while maintaining In particular, Hin Hinton ton et al. (2012c) recommend constraining the norm of eac each h some stability. column of the weigh weightt matrix of a neural net lay layer, er, rather than constraining the In particular, Hin ton et al. ( 2012c ) recommend constraining theby norm of eac h Frob robenius enius norm of the en entire tire weigh weightt matrix, a strategy in intro tro troduced duced Srebro and column of the weigh t matrix of athe neural layer,column ratherseparately than constraining they Shraibman (2005 ). Constraining normnet of each prev prevents ents an any Frobhidden enius norm theha enving tire weigh t matrix, a ts. strategy trovduced by Srebro andt one unit of from having very large weigh eights. If we in con conv erted this constrain constraint Shraibman (2005 the normit ofwould each column separately any in into to a penalt enalty y in). aConstraining Lagrange function, be similar to L2 wprev eigh eightents t decay one hidden unit from ha ving v ery large w eigh ts. If w e con v erted this constrain but with a separate KKT multiplier for the weigh weights ts of each hidden unit. Each oft into aKKT penalt in a Lagrange function, it would be similar to L w t edecay these myultipliers would b e dynamically up updated dated separately toeigh mak make eac each h but with a separate KKT multiplier for the weigh ts of each hidden unit. Each of hidden unit ob obey ey the constraint. In practice, column norm limitation is alwa always ys these KKT m ultipliers w ould b e dynamically up dated separately to mak e eac h implemen implemented ted as an explicit constrain constraintt with repro reprojection. jection. hidden unit obey the constraint. In practice, column norm limitation is always implemented as an explicit constraint with repro jection.
7.3
Regularization and Under-Constrained Problems
In cases, regularization and is necessary for machine learning Problems problems to be 7.3someRegularization Under-Constrained prop properly erly defined. Man Many y linear mo models dels in machine learning, including linear reIn some cases, regularization is necessary machine learning to be X >X gression and PCA, dep depend end on in inv verting theformatrix . This problems is not possible properly y linear dels incan machine learning, including linear reX >X is Man whenev whenever er defined. singular. Thismo matrix be singular whenev whenever er the data truly X gression and PCA, depend on inverting thethere matrix This is not possible X) has no variance in some direction, or when areX few fewer er .examples (ro (rows ws of ofX X X whenev er is singular. This matrix can b e singular whenev er the data truly than input features (columns of X). In this case, many forms of regularization has no ond variance someXdirection, > X + αI or when there are fewer examples (rows of X ) corresp correspond to in inv vin erting instead. This regularized matrix is guaran guaranteed teed than features (columns of X). In this case, many forms of regularization to be input in invertible. vertible. correspond to inverting X X + αI instead. This regularized matrix is guaranteed These linear problems ha hav ve closed form solutions when the relev relevant ant matrix to be invertible. is in invertible. vertible. It is also possible for a problem with no closed form solution to be These linear problems have isclosed form solutionsapplied when to thea relev ant matrix underdetermined. An example logistic regression problem where is inclasses vertible. is also possible for Ifa aproblem solution to be the areItlinearly separable. weigh weightt with vectornowclosed is ableform to achiev achieve e perfect underdetermined. example ishiev logistic regression appliedand to ahigher problem where w will classification, then 2An also ac achiev hieve e perfect classification likelihoo likelihood. d. w the classes are linearly separable. If a weigh t vector is able to achiev e p erfect An iterativ iterativee optimization pro procedure cedure lik likee sto stochastic chastic gradient descent will con contin tin tinually ually w classification, then 2 will also ac hiev e p erfect classification and higher likelihoo d. increase the magnitude of w and, in theory theory,, will never halt. In practice, a numerical An iterativtation e optimization pro ceduret lik e sto chastic descenttly will contin ually implemen implementation of gradien gradient t descen descent will even eventually tuallygradient reach sufficien sufficiently large weigh weights ts w increase the magnitude of and, in theory , will never halt. In practice, a n umerical to cause numerical ov overflow, erflow, at which point its behavior will dep depend end on ho how w the implemen tation of gradien t descen t will even tually reach sufficien tly large weigh ts programmer has decided to handle values that are not real num umb bers. to cause numerical overflow, at which point its behavior will depend on how the Most forms regularization are vable guarantee convergence vergence iterativee programmer hasofdecided to handle aluestothat are notthe realcon num bers. of iterativ 239to guarantee the convergence of iterative Most forms of regularization are able
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
metho methods ds applied to underdetermined problems. F For or example, weigh weightt deca decay y will cause gradien gradientt descent to quit increasing the magnitude of the weights when the metho dsthe applied toounderdetermined example, slop slopee of lik likeliho eliho elihoo d is equal to the wproblems. eigh eightt deca decay yFor co coefficien efficien efficient. t. weight decay will cause gradient descent to quit increasing the magnitude of the weights when the ofeliho using solv solvete deca underdetermined slopThe e of idea the lik odregularization is equal to thetoweigh y coefficient. problems extends bey eyond ond mac machine hine learning. The same idea is useful for sev several eral basic linear algebra The idea of using regularization to solve underdetermined problems extends problems. beyond machine learning. The same idea is useful for several basic linear algebra As we saw in Sec. 2.9, we can solve underdetermined linear equations using problems. the Mo Moore-P ore-P ore-Penrose enrose pseudoin pseudoinv verse. Recall that one definition of the pseudoinv pseudoinverse erse As w e saw in Sec. 2.9 , w e can solve underdetermined linear equations using + X of a matrix X is the Moore-Penrose pseudoinverse. Recall that one definition of the pseudoinverse X + = lim (X > X + αI )−1X > . (7.29) X of a matrix X is α&0
= as limp(erforming X X + αlinear I ) Xregression . We can no now w recognize Eq.X7.29 with weigh eightt (7.29) deca decay y. Sp Specifically ecifically ecifically,, Eq. 7.29 is the limit of Eq. 7.17 as the regularization co coefficient efficient shrinks W can no Eq. 7.29 as p erforming linear regression with weight decay. toezero. Wwe recognize can thus in the pseudoin erse as stabilizing underdetermined interpret terpret pseudoinv v Specifically , Eq. regularization. 7.29 is the limit of Eq. 7.17 as the regularization coefficient shrinks problems using to zero. We can thus interpret the pseudoinverse as stabilizing underdetermined problems using regularization.
7.4
Dataset Augmen Augmentation tation
The wa way y to make a mac machine hinetation learning mo model del generalize better is to train it on 7.4bestDataset Augmen more data. Of course, in practice, the amoun amountt of data we ha hav ve is limited. One way The b est wa y to make a mac hine learning mo del generalize b is totraining train it set. on to get around this problem is to create fake data and add itetter to the more data. Of course, in practice, the amoun t of data w e ha v e is limited. One w ay For some mac machine hine learning tasks, it is reasonably straigh straightforward tforward to create new to get around this problem is to create fake data and add it to the training set. fak fake e data. For some machine learning tasks, it is reasonably straightforward to create new This approac approach h is easiest for classification. A classifier needs to tak takee a complifake data. cated, high dimensional input x and summarize it with a single category iden identity tity y . This approac h is easiest for classification. A classifier needs to tak e a compliThis means that the main task facing a classifier is to be inv invariant ariant to a wide variety x and summarize cated, high dimensional with easily a single category identity y . of transformations. We input can generate new (x, y )itpairs just by transforming This the training main taskset. facing a classifier is to be invariant to a wide variety the xmeans inputsthat in our of transformations. We can generate new (x, y ) pairs easily just by transforming This approac approach h is not as readily applicable to many other tasks. For example, it the x inputs in our training set. is difficult to generate new fak fakee data for a density estimation task unless we hav havee This solv approac h isdensit not as readily applicable to many other tasks. For example, it already solved ed the density y estimation problem. is difficult to generate new fake data for a density estimation task unless we have Dataset augmen augmentation tation has been a particularly effective tec technique hnique for a sp specific ecific already solved the density estimation problem. classification problem: ob object ject recognition. Images are high dimensional and include Dataset augmen tation has bofeen a particularly effective technique for simulated. a specific an enormous variet ariety y of factors variation, many of which can be easily classification problem: ob jectthe recognition. Imagesa are dimensional and include Op Operations erations lik like e translating training images fewhigh pixels in each direction can an enormous v ariet y of factors of v ariation, many of which can be easily simulated. often greatly impro improve ve generalization, ev even en if the model has already been designed to Op erations lik e translating the training images a vfew pixels in peach direction can be partially translation in inv varian ariantt by using the con conv olution and ooling techniques often greatly improve generalization, even if the model has already been designed to 240 the convolution and p o oling techniques be partially translation invariant by using
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
describ described ed in Chapter 9. Many other op operations erations suc such h as rotating the image or scaling the image ha have ve also pro proven ven quite effective. described in Chapter 9. Many other operations such as rotating the image or One must be careful not to apply transformations that would change the correct scaling the image have also proven quite effective. class. For example, optical character recognition tasks require recognizing the One must been e careful not’d’ to and apply transformations correct difference bet etw w ’b’ and the difference bet etw wthat een w ’6’ould andchange ’9’, so the horizon horizontal tal class.and For180 example, optical character recognition require recognizing the ◦ flips rotations are not appropriate wa ways ys of tasks augmen augmenting ting datasets for these difference between ’b’ and ’d’ and the difference between ’6’ and ’9’, so horizontal tasks. flips and 180 rotations are not appropriate ways of augmenting datasets for these There are also transformations that we would like our classifiers to be in inv variant tasks. to, but whic which h are not easy to perform. For example, out-of-plane rotation can not There areted alsoastransformations that operation we would like our input classifiers to be invariant be implemen implemented a simple geometric on the pixels. to, but which are not easy to perform. For example, out-of-plane rotation can not Dataset augmen augmentation tation is effect for sp speec eec eech h recognition tasks as well (Jaitly and be implemented as a simple geometric operation on the input pixels. Hin Hinton ton, 2013). Dataset augmentation is effect for speech recognition tasks as well (Jaitly and Injecting noise in the input to a neural net netw work (Sietsma and Do Dow w, 1991) Hinton, 2013). can also be seen as a form of data augmentation. For man many y classification and Injecting noise in the input to a neural net w ork ( Sietsma and Do ev even en some regression tasks, the task should still be possible to solv solve e ev even enwif, 1991 small) can also b e seen as a form of data augmentation. F or man y classification and random noise is added to the input. Neural netw networks orks prov provee not to be very robust ev some regression tasks, task should still be wa possible to solv even if small to en noise, how and the Eliasmith , 2010). One robustness however ever (Tang way y to impro improv ve ethe random added to the input. Neural provenoise not toapplied be verytorobust of neuralnoise net netw wisorks is simply to train them netw withorks random their to noise, how ever ( T ang and Eliasmith , 2010 ). One wa y to impro v e the robustness inputs. Input noise injection is part of some unsup unsupervised ervised learning algorithms such of neural net w orks is simply to train them with random applied their as the denoising auto autoenco enco encoder der (Vincen Vincentt et al. al.,, 2008 ). Noisenoise injection alsotoworks inputs.theInput injection is part of some unsup ervised such when noisenoise is applied to the hidden units, whic which h can blearning e seen asalgorithms doing dataset as the denoising auto encolevels der (Vincen t et al., P 2008 injection alsosho works augmen augmentation tation at m ultiple of abstraction. oole).etNoise al. (2014 ) recently showed wed when the noise is applied to the hidden units, whic h can b e seen as doing dataset that this approach can be highly effective provided that the magnitude of the augmen at m ultipleDrop levelsout, of abstraction. Poole et al. (2014 ) recently noise is tation carefully tuned. Dropout, a pow owerful erful regularization strategy that sho willwed be that this approach can b e highly effective provided that the magnitude of the describ in Sec. 7.12 , can b e seen as a pro of constructing new inputs by described ed process cess noise is carefully tuned. Drop out, a p ow erful regularization strategy that will be multiplying by noise. described in Sec. 7.12, can be seen as a process of constructing new inputs by When comparing machine hine learning benchmark results, it is important to tak takee multiplying by noise. mac the effect of dataset augmen augmentation tation into account. Often, hand-designed dataset When comparing mac hine learning reduce benchmark results, it is important to tak augmen schemes can dramatically the generalization error of a mac augmentation tation machine hinee the effecttec ofhnique. dataset tation into account. Often, hand-designed dataset learning technique. Toaugmen compare the performance of one mac machine hine learning algorithm augmen tation can dramatically reduce the generalization error ofcomparing a machine to another, it schemes is necessary to perform controlled experiments. When learning tec hnique. T o compare the performance of one mac hine learning mac machine hine learning algorithm A and machine learning algorithm B, it is algorithm necessary to another, it is necessary to p erform controlled experiments. When comparing to mak makee sure that both algorithms were ev evaluated aluated using the same hand-designed machineaugmen learning algorithm A and machine learning algorithm B, it p isonecessary dataset augmentation tation schemes. Supp Suppose ose that algorithm A performs orly with to mak e sure that both algorithms w ere ev aluated using the same hand-designed no dataset augmen augmentation tation and algorithm B performs well when com combined bined with dataset augmen tation schemes. Supp ose that algorithm A p erforms p orly with numerous synthetic transformations of the input. In such a case it isolikely the no dataset augmen tation and algorithm B p erforms w ell when com bined with syn synthetic thetic transformations caused the improv improved ed performance, rather than the use numerous of the input. In such a case an it isexp likely thet of mac machine hinesynthetic learning transformations algorithm B. Sometimes deciding whether experimen erimen eriment synthetic transformations caused the improved performance, rather than the use of machine learning algorithm B. Sometimes deciding whether an experiment 241
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
has been prop properly erly con controlled trolled requires sub subjective jective judgment. For example, mac machine hine learning algorithms that inject noise into the input are performing a form of dataset has beentation. properly controlled requires subare jective judgment. For example, hine augmen augmentation. Usually Usually, , operations that generally applicable (such asmac adding learning algorithms that inject into thepart input erforming a form algorithm, of dataset Gaussian noise to the input) arenoise considered of are thepmac machine hine learning augmen tation. Usually , operations that are generally applicable (such as adding while op operations erations that are sp specific ecific to one application domain (such as randomly Gaussian an noise to theare input) are considered part of pre-pro the maccessing hine learning cropping image) considered to be separate pre-processing steps. algorithm, while operations that are specific to one application domain (such as randomly cropping an image) are considered to be separate pre-processing steps.
7.5
Noise Robustness
Sec. has motiv motivated ated the use of noise applied to the inputs as a dataset aug7.5 7.4Noise Robustness men mentation tation strategy strategy.. For some mo models, dels, the addition of noise with infinitesimal Sec. 7.4 has motiv ated the use of applied inputs dataset variance at the input of the mo model delnoise is equiv equivalent alentto tothe imp imposing osing aaspaenalty on augthe men tation strategy . F or some mo dels, the addition of noise with infinitesimal norm of the weigh weights ts (Bishop, 1995a,b). In the general case, it is imp important ortant to vremem ariance the noise inputinjection of the mo delbeismuc equiv alent pto penalty on the rememb beratthat can much h more owimp erfulosing thanasimply shrinking norm of the weigh tsecially (Bishop , 1995a ). Inisthe general case, it is imp ortant to the parameters, esp especially when the,bnoise added to the hidden units. Noise remembto er the thathidden noise injection can an be imp mucortan h more powas erful applied units is such importan ortant t topic to than merit simply its ownshrinking separate the parameters, esp ecially when the noise is added to the hidden units. Noiset discussion; the drop dropout out algorithm describ described ed in Sec. 7.12 is the main developmen development applied to the hidden units is such an important topic as to merit its own separate of that approac approach. h. discussion; the dropout algorithm described in Sec. 7.12 is the main development Another wa way y that noise has been used in the service of regularizing mo models dels of that approach. is by adding it to the weigh weights. ts. This technique has been used primarily in the Another wa y that noise has b een (used in al. the, 1996 service ofvregularizing models con context text of recurren recurrentt neural netw networks orks Jim et ; Gra Grav es, 2011). This can ise by adding it as to athe weigh ts.implemen This technique been usedinference primarily b interpreted sto stochastic chastic implementation tation ofhas a Bay Bayesian esian ovin er the the con text of recurren t neural netw orks ( Jim et al. , 1996 ; Gra v es , 2011 ). This can weigh eights. ts. The Ba Bay yesian treatment of learning would consider the mo model del we weigh igh ights ts b e interpreted as a sto chastic implemen tation of a Bay esian inference o v er the to be uncertain and represen representable table via a probabilit probability y distribution that reflects this w eigh ts. The Ba y esian treatment of learning would consider the mo del ights uncertain uncertaintty. Adding noise to the weights is a practical, stochastic wa way y towe reflect to beuncertain uncertain represen table this uncertaint ty and (Gra Graves ves, 2011 ). via a probability distribution that reflects this uncertainty. Adding noise to the weights is a practical, stochastic way to reflect This can also be in interpreted terpreted as equiv equivalen alen alentt (under some assumptions) to a this uncertainty (Graves, 2011). more traditional form of regularization. Adding noise to the weigh weights ts has been This can also b e in terpreted as equiv alen t (under some assumptions) to a sho shown wn to be an effectiv effectivee regularization strategy in the context of recurren recurrentt neural more traditional of regularization. to thewe weigh has been net netw works (Jim et form al., 1996 ; Gra Grav ves, 2011).Adding In thenoise following, willtspresent an sho wn to b e an effectiv e regularization strategy in the context of recurren t neural analysis of the effect of weigh eightt noise on a standard feedforw feedforward ard neural netw network ork (as net w orks ( Jim et al. , 1996 ; Gra v es , 2011 ). In the following, we will present an in intro tro troduced duced in Chapter 6). analysis of the effect of weight noise on a standard feedforward neural network (as Wduced e study regression intro in the Chapter 6). setting, where we wish to train a function yˆ( x) that maps a set of features x to a scalar using the least-squares cost function betw between een y ˆ ( x W e study the regression setting, where we wish to train a function ) that the mo model del predictions yˆ(x) and the true values y: maps a set of features x to a scalar using cost function between the least-squares 2 ˆv(alues x) − y =E (7.30) p(x,y ) (y the model predictions yˆ(x) Jand the true y:) . E The training set consists of Jm=labeled examples )}. (yˆ(x) y{()x (1) . , y (1)), . . . , (x (m), y (m)(7.30) − 242 The training set consists of m labeled examples (x , y ), . . . , (x , y ) . { }
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
We no now w assume that with eac each h input presen presentation tation we also include a random perturbation W ∼ N (; 0, ηI ) of the net netw work weigh weights. ts. Let us imagine that we W e no w assume that with eac h input presen tation we del alsoasinclude random ha hav ve a standard l-la -lay yer MLP MLP.. We denote the perturb erturbed ed mo model yˆW (x)a. Despite ( ; 0 , η I perturbation ) ofin the netwin orkminimizing weights. Let imagine that we the injection of noise, we are still interested terested the us squared error of the ha v e a standard -la y er MLP . W e denote the p erturb ed mo del as ) . Despite l y ˆ ( x ∼ N The ob output of the net network. work. objective jective function th thus us becomes: the injection of noise, we are still interested in minimizing the squared error of the h i 2 output of the network. jectiveˆfunction thus becomes: (7.31) J˜W =The Ep(xob ,y,W ) (y W (x) − y ) E 2 ) )− 2y y)ˆW (x) + y2 . (7.32) (7.31) (x J˜ = Ep(x,y,W ) yˆ(yˆW (x E −2y yˆ (x) + y . = (x) added (7.32) For small η, the minimization ofyˆJ with weight noise (with co cov variance − an additional regularization term: ηI ) is equiv equivalent to minimization of J with alent F or small , the minimization of with addedi w eight noisethe (with covariance η J h 2 ηEp(x,y) k∇ W yˆ(x)k . This form of regularization encourages parameters to η I J ) is equiv alent to minimization of with an additional r e gularization term: goE to regions of parameter space where small perturbations of the weigh weights ts hav havee regularization encourages the parameters to y ˆ ( x ) η . This form of a relatively small influence on the output. In other words, it pushes the mo model del go to regions of parameter space where small p erturbations of the weigh ts hav e k∇ k in regions where the mo is relativ insensitive to small v ariations in the into to model del relatively ely a eigh relatively smallpoin influence other words, it pushes the model w eights, ts, finding oints ts that on arethe notoutput. merely In minima, but minima surrounded by in to regions where the mo del is relativ ely insensitive to small v ariations in the flat regions (Hochreiter and Schmidh Schmidhub ub uber er, 1995). In the simplified case of linear finding that w eigh ts, p oin ts are not merely but minima surrounded by b ), this regularization regression (where, for instance, yˆ( x) = w >x +minima, term collapses flat Schmidh uber, 1995 ). In the simplified case does of linear ηEp(x) (kHochreiter xk2 , whic in into to regions which hand is not a function of parameters and therefore not regression (where, for instance,˜yˆ( x) = w x + b ), this regularization term collapses con contribute tribute to the gradien gradientt of J W with resp respect ect to the model parameters. E x , which is not into η a function of parameters and therefore does not contribute tok the k gradient of J˜ with respect to the model parameters.
7.5.1
Injecting Noise at the Output Targets
7.5.1 datasets Injecting Noiseamoun at the Most hav amount t ofOutput mistakes T inargets the y lab labels. els. It can be harmful ha ve some to maximize log p(y | x) when y is a mistak mistake. e. One way to prev preven en entt this is to y Most datasets ha v e some amoun t of mistakes in the lab els. It can be harmful explicitly mo model del the noise on the lab labels. els. For example, we can assume that for some log p ( y y x to maximize ) when is a mistak e. One w a y to prev en t this to small constant , the training set lab label el y is correct with probabilit probability y 1 − , isand explicitly mo delofthe onpthe labels. For might example, we can assume that for some |other otherwise any thenoise ossible lab labels els be correct. This assumption is y small constant , the training set lab el is correct with probabilit y 1 , and easy to incorp incorporate orate in into to the cost function analytically analytically,, rather than by explicitly otherwise any of the other p ossible lab els be correct. Thisaassumption is − based dra drawing wing noise samples. F For or example, lab label el might smo smoothing othing regularizes mo model del easy to incorp orate in to the cost function analytically , rather than by explicitly on a softmax with k output values by replacing the hard 0 and 1 classification drawingwith noisetargets samples. or example, el smoothing regularizes model based lab targets of kFand 1 − k−1 , respectively respectively. . The standarda cross-en cross-entropy tropy k k on a softmax with output v alues by replacing the hard 0 and 1 classification loss may then be used with these soft targets. Maxim Maximum um lik likeliho eliho elihoood learning with a targets with targets of and 1 , respectively . The standard tropy softmax classifier and hard targets ma may y actually nev never er con conv verge—thecross-en softmax can loss may then be used with these soft targets. Maxim um lik eliho o d learning with − nev never er predict a probabilit probability y of exactly 0 or exactly 1, so it will con continue tinue to learna softmax classifier and hardmaking targetsmore may extreme actually nev er converge—the can larger and larger w eights, predictions forever. Itsoftmax is possible nevpreven er predict y of exactly 0 or exactly 1strategies , so it will lik con to learn to prevent t thisa probabilit scenario using other regularization like e tinue weigh weight t deca decay y. larger and larger w eights, making more extreme predictions forever. It is possible Lab Label el smo smoothing othing has the adv advantage antage of prev preventing enting the pursuit of hard probabilities to preven t this scenario using other regularization strategies weigh t deca without discouraging correct classification. This strategy haslik beeen used sincey. Label smoothing has the advantage of preventing the pursuit of hard probabilities 243 without discouraging correct classification. This strategy has been used since
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
the 1980s and con contin tin tinues ues to be featured prominen prominently tly in mo modern dern neural netw networks orks (Szegedy et al., 2015). the 1980s and continues to be featured prominently in modern neural networks (Szegedy et al., 2015).
7.6
Semi-Sup Semi-Supervised ervised Learning
In of semi-supervised learning, both unlab unlabeled eled examples from P (x) 7.6the paradigm Semi-Sup ervised Learning and lab labeled eled examples from P (x, y ) are used to estimate P (y | x) or predict y from In the paradigm of semi-supervised learning, both unlabeled examples from P (x) x. and labeled examples from P (x, y ) are used to estimate P (y x) or predict y from context text of deep learning, semi-sup semi-supervised ervised learning usually refers to x. In the con | h = f ( x ) . learning a represen representation tation The goal is to learn a represen representation tation so the context deep learning, semi-sup ervised learning usually refers to thatInexamples fromofthe same class ha hav ve similar representations. Unsup Unsupervised ervised h cues = f (for x ). how learning can a represen tation The to goal is toexamples learn a represen tation so learning provide useful group in representation that examples from the same class ha v e similar representations. Unsup ervised space. Examples that cluster tigh tightly tly in the input space should be mapp mapped ed to learning can provide useful cues for how to group examples in representation similar represen representations. tations. A linear classifier in the new space may achiev achievee better space. Examples thaty cluster tightlyand in the input space should mapp ed generalization in man many cases (Belkin Niy Niyogi ogi , 2002 ; Chap Chapelle elle etbeal. al., , 2003 ). to A similar represen tations. A linear classifier in the new space may achiev e b etter long-standing varian ariantt of this approach is the application of principal comp componen onen onents ts generalization in man y cases ( Belkin and Niy ogi , 2002 ; Chap elle et al. , 2003 ). A analysis as a pre-pro pre-processing cessing step before applying a classifier (on the pro projected jected long-standing variant of this approach is the application of principal components data). analysis as a pre-processing step before applying a classifier (on the pro jected Instead of having separate unsup unsupervised ervised and sup supervised ervised comp componen onen onents ts in the data). mo model, del, one can construct mo models dels in which a generative model of either P (x) or Instead of having separate ervised and sup ervised onen ts in can the P ( x, y) shares parameters withunsup (y | x a discriminative mo model del of Pcomp ). One P (x) or model, one canthe construct models in which model either − loga Pgenerative (y | x) with then trade-off sup supervised ervised criterion the of unsup unsupervised ervised or P ( x, y ) shares parameters delgenerative of P (y x ). One then can P (x)aordiscriminative − log P (x, y )).mo generativ generative e one (suc (such h as − logwith The criterion Pout x) solution (y the then trade-off the supervised with thetounsup ervised or |the sup expresses a particular form of criterion prior belieflog ab about supervised ervised log P log P ( x ( x , y generativ e one (suc h as ) or ) ). The generative criterion then learning problem (Lasserre et al., 2006− ), namely |that the structure of P(x ) is expresses a particular form of prior b elief out the solution to the supervised − of P (y | x−) in aab connected to the structure wa way y that is captured by the shared learning problem By (Lasserre et al. , w2006 ),hnamely that the structure of included P(x ) is parametrization. con controlling trolling ho how muc much of the generative criterion is P (ya bxetter connected the structure ) in atrade-off way that is captured by the shared in the totalto criterion, one canoffind than with a purely generative parametrization. By conetrolling how the generative criterion is included | much of or a purely discriminativ discriminative training criterion (Lasserre et al., 2006 ; Larochelle and in the total criterion, one can find a b etter trade-off than with a purely generative Bengio, 2008). or a purely discriminative training criterion (Lasserre et al., 2006; Larochelle and In the con context scarcity y of labeled data (and abundance of unlabeled data), Bengio , 2008 ).text of scarcit deep arc architectures hitectures ha hav ve shown promise as well. Salakhutdino Salakhutdinov v and Hinton (2008) In the con text of scarcit y of labeled data (and abundance unlabeled data), describ describee a metho method d for learning the kernel function of a kernelofmachine used for deep arc hitectures ha v e shown promise as w ell. Salakhutdino v and Hinton ( 2008 P ( x regression, in whic which h the usage of unlabeled examples for mo modeling deling ) impro improv ves) describ a metho d for learning the kernel function of a kernel machine used for P (y | xe) quite significantly significantly. . regression, in which the usage of unlabeled examples for modeling P (x ) improves et al. (2006. ) for more information ab about out semi-sup semi-supervised ervised learning. P (ySeex)Chapelle quite significantly | Chapelle et al. (2006) for more information about semi-supervised learning. See 244
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.7
Multi-T Multi-Task ask Learning
Multi-task learningask (Caruana , 1993) is a wa way y to improv improvee generalization by pooling 7.7 Multi-T Learning the examples (whic (which h can be seen as soft constraints imp imposed osed on the parameters) Multi-task learning ( Caruana , 1993 ) is a wa y to improv e generalization y pooling arising out of several tasks. In the same wa way y that additional trainingbexamples the examples (whic h can b e seen as soft constraints imp osed on the parameters) put more pressure on the parameters of the model tow towards ards values that generalize arising out part of several tasks. the same y thatthat additional training w ell, when of a mo model del is In shared acrosswatasks, part of the modelexamples is more put more pressure on the parameters of the model tow ards v alues that generalize constrained to tow wards go goo od values (assuming the sharing is justified), often yielding w ell, when part of a mo better generalization. del is shared across tasks, that part of the model is more constrained towards good values (assuming the sharing is justified), often yielding better generalization. y (1)
y (2)
h(1)
h(2)
h(3)
h(shared)
x
Figure 7.2: Multi-task learning can be cast in sev several eral wa ways ys in deep learning framew frameworks orks and this figure illustrates the common situation where the tasks share a common input but Figure Multi-task be cast in lo sev eral wa orks in inv volve 7.2: differen different t target learning random vcan ariables. The lower wer lay layers ersysofina deep deep learning net network work framew (whether it and this figureand illustrates the common situation where the tasks share a common input but is sup supervised ervised feedforward or includes a generative comp componen onen onent t with do down wn wnw ward arrows) involve differenacross t targets uc random The lo wer parameters layers of a deep networkresp (whether it can b e shared uch h tasks,variables. while task-sp task-specific ecific (associated respectively ectively (1) (2) is supthe ervised and feedforward orhincludes a generative onen wnward arrows) with weigh weights ts in into to and from and h ) can b e comp learned ont with top ofdothose yielding a (shared) can b e representation shared across h s uc h tasks, while task-sp ecific parameters ectively shared . The underlying assumption is that(associated there existsresp a common with weightsthat intoexplain and from and h in) the caninput b e learned on eac tophoftask those yielding a p ool the of factors thehvariations each is associated x , while h shared representation The is thatassumed there exists common with a subset of these factors. . In thisunderlying example, assumption it is additionally thata top-lev top-level el x , while p ool of units factors that thesp vecialized ariationstoineach the input eachpredicting task is associated hidden andexplain specialized task (resp (respectively ectively h(1) h(2) are y(1) and (2) a subset of these factors. In this example, it is (shared) with assumed top-levIn el ) while some in intermediate-level termediate-level representation h additionally is shared acrossthat all tasks. y h h y hidden units and are sp ecialized to each task (resp ectively predicting and the unsup unsupervised ervised learning con context, text, it mak makes es sense for some of the top-lev top-level el factors to b e yasso h are theisfactors )ciated while with somenone intermediate-level representation sharedthat across all tasks. associated of the output tasks (h (3)): these explain some In of (1) of the (2)top-level factors to b e the input unsup ervised learning con text, it mak es sense for some variations but are not relev relevant ant for predicting y or y . associated with none of the output tasks (h ): these are the factors that explain some of y y . the Fig. input7.2 variations butaare notcommon relevantform for predicting illustrates very of multi-task or learning, in whic which h different
sup supervised ervised tasks (predicting y(i) giv given en x) share the same input x, as well as some Fig. 7.2 illustrates a very common formcapturing of multi-task learning, in of whic h different (shared) h in representation a common pool factors. The intermediate-lev termediate-lev termediate-level el supervised tasks (predicting y given x) share the same input x, as well as some intermediate-level representation h 245capturing a common p o ol of factors. The
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
mo model del can generally be divided into two kinds of parts and associated parameters: mo1. delTcan generally be divided(whic into htwo kinds of parts ask-sp ask-specific ecific parameters (which only benefit fromand the associated examples ofparameters: their task to achiev achievee go goo od generalization). These are the upp upper er lay layers ers of the neural 1. net T ask-sp ecific parameters (whic h only b enefit from the examples of their task netw work in Fig. 7.2. to achieve good generalization). These are the upper layers of the neural 2. Generic parameters, (which h benefit from the network in Fig. 7.2. shared across all the tasks (whic pooled data of all the tasks). These are the low lower er lay layers ers of the neural net network work 2. in Generic parameters, shared across all the tasks (whic h b enefit from the Fig. 7.2. pooled data of all the tasks). These are the lower layers of the neural network in Fig. 7.2. Impro Improv ved generalization and generalization error bounds (Baxter, 1995) can be ac achiev hiev hieved ed because of the shared parameters, for which statistical strength can be Improved generalization and generalization error bounds (Baxter , 1995) for canthe be greatly improv improved ed (in prop proportion ortion with the increased num umb ber of examples achieved because ofcompared the shared which statistical canthis be shared parameters, to parameters, the scenario for of single-task mo models). dels).strength Of course greatly improv ed (in prop ortion with the increased n um b er of examples for the will happen only if some assumptions ab about out the statistical relationship betw etween een shared parameters, compared to the scenario of single-task mo dels). Of course this the differen tasks are v alid, meaning that there is something shared across some differentt will happen of the tasks. only if some assumptions about the statistical relationship between the different tasks are valid, meaning that there is something shared across some From the poin ointt of view of deep learning, the underlying prior belief is the of the tasks. follo following: wing: among the factors that explain the variations observed in the F the poinwith t of view deep learning, the underlying prior belieftw isothe datarom asso associated ciated the of differen different t tasks, some are shared across two or among the factors that explain the variations observed in the following: more tasks. data associated with the different tasks, some are shared across two or more tasks.
7.8
Early Stopping
When mo models dels with sufficien sufficientt representational capacit capacity y to ov overfit erfit 7.8 training Early large Stopping the task, we often observe that training error decreases steadily over time, but When training large models with sufficien t representational capacit y to overfit validation set error begins to rise again. See Fig. 7.3 for an example of this behavior. the task, we often observe that training error decreases steadily over time, but This beha ehavior vior occurs very reliably reliably. . validation set error begins to rise again. See Fig. 7.3 for an example of this behavior. This means we can obtain a mo model del with better validation set error (and th thus, us, This behavior occurs very reliably. hop hopefully efully better test set error) by returning to the parameter setting at the poin ointt This means w e can obtain a mo del with better v alidation set error (and th us, in time with the low lowest est validation set error. Instead of running our optimization hopefully bun etter test set by minim returning tovthe parameter poin algorithm until til we reac reach h error) a (lo (local) cal) minimum um of alidation error,setting we runatit the un until til thet in time lowestset validation set error. Instead running our optimization error on with the vthe alidation has not improv improved ed for some of amoun amount t of time. Every time algorithm un til we reac h a (lo cal) minim um of v alidation error, we run it until the the error on the validation set improv improves, es, we store a cop copy y of the mo model del parameters. error on the v alidation set has not improv ed for some amoun t of time. Every time When the training algorithm terminates, we return these parameters, rather than the latest error on the validation improv we store amore copyformally of the mo parameters. the parameters. Thisset pro procedure cedurees, is sp specified ecified indel Algorithm 7.1. When the training algorithm terminates, we return these parameters, rather than This strategy is kno known wn as early stopping stopping.. It is probably the most commonly the latest parameters. This procedure is specified more formally in Algorithm 7.1. used form of regularization in deep learning. Its popularity is due both to its Thiseness strategy is kno wn asy.early stopping. It is probably the most commonly effectiv effectiveness and its simplicit simplicity used form of regularization in deep learning. Its popularity is due both to its 246 effectiveness and its simplicity.
Loss (negative log likelihood)
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Learning curves
0.20
Training set loss Learning curves Validation set loss Training set loss Validation set loss
0.15 0.10 0.05 0.00
0
50
100
150
200
250
Time (epochs)
Figure 7.3: Learning curves sho showing wing ho how w the negativ negativee log-likelihoo log-likelihood d loss changes over time (indicated as num numb b er of training iterations over the dataset, or ep epo ochs chs). ). In this Figure 7.3: Learning curves sho wing ho w the negativ e log-likelihoo d loss c hanges over example, we train a maxout netw network ork on MNIST. Observe that the training ob objective jective ep o chs time (indicated as num b er of training iterations o ver the dataset, or ). In decreases consistently over time, but the validation set av average erage loss even eventually tually b eginsthis to example,again, we train a maxout network on MNIST. Observe that the training ob jective increase forming an asymmetric U-shap U-shaped ed curv curve. e. decreases consistently over time, but the validation set average loss eventually b egins to increase again, forming an asymmetric U-shap ed curve.
One wa way y to think of early stopping is as a very efficient hyperparameter selection algorithm. In this view, the num numb ber of training steps is just another hyperparameter. to Fig. think7.3 of early is as a very efficient selection We One can wa seey in that stopping this hyp yperparameter erparameter has a hyperparameter U-shap U-shaped ed validation set algorithm. In this view, the num b er of training steps is just another hyperparameter. performance curve. Most hyperparameters that con control trol mo model del capacit capacity y hav havee such a W e can see in Fig. 7.3 that this h yp erparameter has a U-shap ed v alidation U-shap U-shaped ed validation set performance curve, as illustrated in Fig. 5.3. In the caseset of p erformance curve. Most hyperparameters that con trol mo del capacit y hav e such a early stopping, we are controlling the effective capacit of the mo by determining capacity y model del U-shap edyvsteps alidation settake performance as illustrated Fig. 5.3. In themcase ho how w man many it can to fit thecurve, training set. Most hinyp yperparameters erparameters ust bofe e are controlling the chec effective capacit y of the determining cearly hosenstopping, using anwexp expensive ensive guess and check k process, where we mo setdel a hby yp yperparameter erparameter ho w man y steps it can take to fit the training set. Most h yp erparameters mustThe be at the start of training, then run training for several steps to see its effect. chosen using an hyperparameter expensive guess and check process, we set a h erparameter “training time” is unique in thatwhere by definition ayp single run of at the start training, training for several steps seesignifican its effect.t The training triesofout man many y vthen aluesrun of the hyp yperparameter. erparameter. Theto only significant cost “training time” hyperparameter is unique in that by definition a single run of to cho hoosing osing this hyperparameter automatically via early stopping is running the training tries out man y v alues of the h yp erparameter. The only significan t cost validation set ev evaluation aluation perio eriodically dically during training. Ideally Ideally,, this is done in to c ho osing this hyperparameter automatically via early stopping is running the parallel to the training pro process cess on a separate machine, separate CPU, or separate vGPU alidation aluation perio dically training.areIdeally this is then donethe in from set the ev main training pro process. cess. Ifduring such resources not av, ailable, parallel to theperio training pro cess onma ay separate machine, separate CPU, orsetseparate cost of these eriodic dic ev evaluations aluations may be reduced by using a validation that is GPU from the main training pro cess. If such resources are not a v ailable, then the small compared to the training set or by ev evaluating aluating the validation set error less cost of these p erio dic ev aluations ma y b e reduced by using a v alidation set that is frequen frequently tly and obtaining a lo lower wer resolution estimate of the optimal training time. small compared to the training set or by evaluating the validation set error less An tly additional cost toaearly the needoftothe maintain copy oftime. the frequen and obtaining lowerstopping resolutionis estimate optimal atraining best parameters. This cost is generally negligible, because it is acceptable to store Anparameters additionalincost tower early is the need to (for maintain a copy of the these a slo slower andstopping larger form of memory example, training in best parameters. This cost is generally negligible, because it is acceptable to store these parameters in a slower and larger247 form of memory (for example, training in
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
GPU memory memory,, but storing the optimal parameters in host memory or on a disk driv drive). e). Since the best parameters are written to infrequen infrequently tly and nev never er read during GPU memory , but storing the optimal parameters in host memory or on atime. disk training, these occasional slow writes ha have ve little effect on the total training drive). Since the best parameters are written to infrequently and never read during Early these stopping is a very unobtrusiv unobtrusive e ve form of regularization, in that it requires training, occasional slow writes ha little effect on the total training time. almost no change in the underlying training pro procedure, cedure, the ob objectiv jectiv jectivee function, Early is a v ery unobtrusiv e form regularization, thattoituse requires or the setstopping of allow allowable able parameter values. Thisofmeans that it isin easy early almost no change in the underlying training pro cedure, the ob jectiv e function, stopping without damaging the learning dynamics. This is in con contrast trast to weigh weightt or the set of allow able parameter v alues. This means that it is easy to use deca decay y, where one must be careful not to use to too o muc uch h weight deca decay y and trapearly the stopping without damaging the learning dynamics. This is in con trast to weigh net netw work in a bad lo local cal minimum corresp corresponding onding to a solution with pathologicallyt decay,wwhere small eigh eights. ts.one must be careful not to use too much weight decay and trap the network in a bad local minimum corresponding to a solution with pathologically Early stopping may be used either alone or in conjunction with other regularizasmall weights. tion strategies. Even when using regularization strategies that mo modify dify the ob objective jective Early stopping may b e used either alone or in conjunction with other regularizafunction to encourage better generalization, it is rare for the best generalization to when of using strategies that modify the ob jective otion ccurstrategies. at a localEven minimum the regularization training ob objective. jective. function to encourage better generalization, it is rare for the best generalization to Early requires of a vthe alidation set,obwhich means some training data is not occur at astopping local minimum training jective. fed to the model. To best exploit this extra data, one can perform extra training stopping requires a vearly alidation set, which means someIntraining data is not afterEarly the initial training with stopping has completed. the second, extra fed to thestep, model. Tothe best exploitdata this is extra data, one canare perform extrastrategies training training all of training included. There two basic aftercan theuse initial training with early stopping has completed. In the second, extra one for this second training procedure. training step, all of the training data is included. There are two basic strategies (Algorithm 7.2) is toprocedure. initialize the mo model del again and retrain on all one One can strategy use for this second training of the data. In this second training pass, we train for the same num numb ber of steps as ) is to initialize the moin delthe again retrain on are all the One earlystrategy stopping(Algorithm procedure7.2 determined was optimal firstand pass. There of thesubtleties data. In this second training we trainFor forexample, the same there numbis er not of steps as some associated with thispass, pro procedure. cedure. a goo good d the early stopping procedure determined was optimal in the first pass. There are way of kno knowing wing whether to retrain for the same number of parameter up updates dates or some subtleties associated with this pro cedure. F or example, there is not a good the same num er of passes through the dataset. On the second round of training, umb b w ayhofpass knothrough wing whether to retrain the same umber of parameter datesthe or eac each the dataset will for require morenparameter up updates dates bup ecause the sameset num er of passes through the dataset. On the second round of training, training is bbigger. each pass through the dataset will require more parameter updates because the Another strategy for using all of the data is to keep the parameters obtained training set is bigger. from the first round of training and then con now w using all of continue tinue training but no Another strategy for using all of the data is to k eep the parameters the data. At this stage, we no now w no longer ha have ve a guide for when to stop obtained in terms con tinue from the first round of training and then training but no w using of of a num numb ber of steps. Instead, we can monitor the av average erage loss function onallthe the data. A t this stage, w e no w no longer ha ve a guide for when to stop in terms validation set, and contin continue ue training until it falls belo elow w the value of the training of a num b er of steps. Instead, we can monitor the av erage This loss function the set ob objective jective at whic which h the early stopping procedure halted. strategy on avoids vthe alidation set,ofand continue until scratch, it falls bbut elowis the alue of the training high cost retraining thetraining mo model del from not vas well-b ell-behav ehav ehaved. ed. For set ob jective at whic h the early stopping procedure halted. This strategy avoids example, there is not any guarantee that the ob objectiv jectiv jectivee on the validation set will the high cost of retraining the mo del from scratch, but is not as w ell-b ehav ed. For ev ever er reac reach h the target value, so this strategy is not ev even en guaranteed to terminate. example, there is is presen not any that the ob jective on This pro procedure cedure presented tedguarantee more formally in Algorithm 7.3the . validation set will ever reach the target value, so this strategy is not even guaranteed to terminate. Early stopping is also useful because it reduces the computational cost of the This procedure is presented more formally in Algorithm 7.3. 248it reduces the computational cost of the Early stopping is also useful because
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
˜ w
w1
w∗ w2
w2
w∗ ˜ w
w1
Figure 7.4: An illustration of the effect of early stopping. (L (Left) eft) The solid contour lines indicate the contours of the negative log-lik log-likeliho eliho elihoo o d. The dashed line indicates the (Left) Figure 7.4:tak An of the effect of early The solid contour tra trajectory jectory taken en illustration by SGD beginning from the origin.stopping. Rather than stopping at the p oin ointt ∗ lines indicate the contours of the negative log-lik eliho o d. The dashed line indicates the trajectory jectory stopping at an earlier w that minimizes the cost, early stopping results in the tra 2 tra jectory taken bAn y SGD beginning from the origin. Rather than stopping at the pThe oint p oin oint t w˜. (Right) illustration of the effect of L regularization for comparison. w thatcircles 2 minimizes the cost, early stopping results in ythe tra jectory at um an earlier dashed indicate the contours of the L p enalt enalty , which causesstopping the minim minimum of the w˜. (Right) p oint cost An illustration of thethe effect of L regularization for comparison. The total to lie nearer the origin than minimum of the unregularized cost. L dashed circles indicate the contours of the p enalty, which causes the minimum of the total cost to lie nearer the origin than the minimum of the unregularized cost.
training pro procedure. cedure. Besides the obvious reduction in cost due to limiting the num numb ber of training iterations, it also has the benefit of providing regularization without training pro Besides the obvious in function cost due to the numbof er requiring thecedure. addition of penalt enalty y terms reduction to the cost or limiting the computation of training iterations, it also has the benefit of providing regularization without the gradien gradients ts of suc such h additional terms. requiring the addition of penalty terms to the cost function or the computation of the gradients of such additional terms. hav ve stated that early Ho How w early stopping acts as a regularizer: So far we ha stopping is a regularization strategy strategy,, but we hav havee supp supported orted this claim only by Ho w early stopping a vregularizer: So far have stated that What early sho showing wing learning curves acts whereasthe alidation set error haswae U-shap U-shaped ed curve. stopping is a mechanism regularization strategy , butstopping we haveregularizes supportedthe thismo claim by is the actual by whic which h early model? del? only Bishop sho wing learning curves where the v alidation set error has a U-shap ed curve. What (1995a) and Sjöb Sjöberg erg and Ljung (1995) argued that early stopping has the effect of is the actual by pro whic h earlytostopping regularizes the moofdel? Bishop restricting themechanism optimization procedure cedure a relativ relatively ely small volume parameter (space 1995a)inand erg and oLjung (1995 ) argued that early stopping has the effect of, θ o. More theSjöb neighborho neighborhoo d of the initial parameter value sp specifically ecifically ecifically, restricting the optimization prosteps cedure to a relativelytosmall volume of parameter τ optimization τ training imagine taking (corresponding iterations) and θ space in the neighborho o d of the initial parameter v alue . More sp ecifically with learning rate . We can view the product τ as a measure of effectiv effectivee capacit capacity y,. τ optimization imagine taking stepsrestricting (corresponding to τnum training Assuming the gradien gradient t is bounded, both the umb ber ofiterations) iterations and and τ with learning rate . W e can view the product as a measure of effectiv e capacit the learning rate limits the volume of parameter space reac reachable hable from θo . In thisy. Assuming ounded, restricting both the numused ber of and sense, τ bthe eha ehav vgradien es as if titiswbere the recipro reciprocal cal of the coefficient foriterations weigh weightt deca decay y. the learning rate limits the volume of parameter space reachable from θ . In this Indeed, wevcan show how—in w—in casecal of of a simple linear mo model del with a quadratic sense, es assho if w it ho were the the recipro the coefficient used for weigh t decay. τ beha error function and simple gradient descent—early stopping is equiv equivalent alent to L2 Indeed, we can show how—in the case of a simple linear model with a quadratic 249 error function and simple gradient descent—early stopping is equivalent to L
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
regularization. In order to compare with classical L 2 regularization, we examine a simple regularization. setting where the only parameters are linear weigh eights ts (θ = w). We can mo model del L In order to compare with classical regularization, w e examine a simple the cost function J with a quadratic appro approximation ximation in the neighborho neighborhoo od of the setting where the only parameters are linear w eigh ts ( ). W e can model θ = w ∗ empirically optimal value of the weigh weights ts w : the cost function J with a quadratic approximation in the neighborhood of the 1 w: ∗ > empirically optimalJˆv(alue (7.33) w − w ) H (w − w ∗), θ) =ofJ (the w ∗ )weigh + (ts 2 1 (7.33) (w respect w ) to Hw (wev w ), at w∗ . Given Jˆ(θ) matrix = J (wof) J + with where H is the Hessian evaluated aluated the 2 ∗ assumption that w is a minim minimum um of J (w ),−we know that−H is positiv ositivee semidefinite. H J w where is the Hessian matrix of with respect to ev aluated Under a local Taylor series appro approximation, ximation, the gradient is given at by:w . Given the assumption that w is a minimum of J (w ), we know that H is positive semidefinite. ∇w Jˆximation, (w) = H (w w∗). (7.34) Under a local Taylor series appro the−gradient is given by: Jˆ(jectory w) = H (w edwby ). the parameter vector during (7.34) We are going to study the tra trajectory follow followed 3 ∇ set the initial parameter − training. For simplicit simplicity y, let us vector to the origin, that W (0)e are going to study the tra jectory followed by the parameter vector during is w = 0. Let us supp suppose ose that we up update date the parameters via gradien gradientt descen descent: t: training. For simplicity, let us set the initial parameter vector to the origin, that 1) τ) w (that =w we(τ− (w (τ−1)) via gradient descen − ∇ (7.35) is w = 0. Let us suppose update the t: w Jparameters (τ−1) (τ−1) ∗ − H (w = (7.36) =w w J (w − )w ) (7.35) (τ−1) ∗ w (τ ) = (7.37) ∇ H (w − w )w ) = (wI − H−)(w (7.36) −)(w of the eigen Let us no now w rewrite this eigenvectors ofH H , exploiting w expression w = (in I the Hspace w−) vectors of (7.37) > the eigendecomp eigendecomposition osition of−H: H = Q − ΛQ , where Λ − is a diagonal matrix and Q Letanusorthonormal now rewrite basis this expression in the space of the eigenvectors of H , exploiting is of eigen eigenvectors. vectors. the eigendecomposition of H: H = QΛQ , where Λ is a diagonal matrix and Q (τ ) w of )(w w (τ−1) − w ∗) − w ∗vectors. = ((I I − QΛQ>)( (7.38) is an orthonormal basis eigen
w − w∗
∗ > (τ−1) ∗ QQ − ww Q>(ww(τ ) − w (w )) (7.39) )(w w) = = ((II − Λ Q)Λ (7.38) (0) (w −wthat − ) = (Iis−chosen Λ)Q to wenough (w be small ) (7.39) Assuming that wQ = 0 and to guarantee |1 − λi | < 1, the parameter τ trajectory jectory−during training after parameter up updates dates − tra − Assuming that w = 0 and that is chosen to be small enough to guarantee is as follo follows: ws: 1 λ < 1, the parameter tra jectory during training after τ parameter updates (7.40) Q >w (τ ) = [I − (I − Λ)τ ]Q>w ∗ . is | as − follo | ws:
˜ in = No Now, w, the expression for Q Eq. can be rearranged Q >w w (7.40) [I 7.13 (I forL Λ2) regularization ]Q w . as: − − ˜ in Eq. 7.13 for L regularization can be rearranged Now, the expression for Q w Q>w˜ = (Λ + αI ) −1 ΛQ >w ∗ (7.41) as: 3
For neural networks, to obtain between Q w˜ symmetry = (Λ + αbreaking I ) ΛQ w hidden units, we cannot initialize (7.41) all the parameters to 0, as discussed in Sec. 6.2. However, the argument holds for any other initial value w(0) . 250
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Q>w˜ = [I − (Λ + αI )−1 α]Q>w∗
(7.42)
]Q hw w˜ = [I, we(Λsee + that αI ) if αthe Comparing Eq. 7.40 andQEq. 7.42 yp yperparameters erparameters , α, (7.42) and τ − are chosen suc such h that Comparing Eq. 7.40 and Eq. 7.42 , theα,hyperparameters , α, (7.43) and τ (Λthat + αIif)−1 (I − Λ)wτ e=see are chosen such that then L2 regularization and(IearlyΛ stopping can equivalent alent (at(7.43) least αIb)e seen ) = (Λ + α, to be equiv under the quadratic approximation of the ob objectiv jectiv jectivee function). Going ev even en further, − stopping L then regularization and early can b e seen to b e equiv alent (at least by taking logarithms and using the series expansion for log (1 + x), we can conclude under the quadratic approximation of the ob jectiv e function). Going ev en further, that if all λi are small (that is, λi 1 and λi/α 1) then by taking logarithms and using the series expansion for log (1 + x), we can conclude 1 λ /α 1 and 1) then that if all λ are small (that is, λ τ≈ , (7.44) α 11 (7.44) ατ ≈ α., (7.45) τ ≈ 1 α . That is, under these assumptions, the num umb plays ys (7.45) a role τb er of training iterations τ pla 2 ≈ in inv versely prop proportional ortional to the L regularization parameter, and the inv inverse erse of τ τ That is, under these assumptions, the n um b er of training iterations pla ys a role pla plays ys the role of the weight deca decay y co coefficient. efficient. inversely proportional to the L regularization parameter, and the inverse of τ Parameter values corresponding onding to directions of significant curv curvature ature (of the plays the role of the wcorresp eight deca y coefficient. ob objectiv jectiv jectivee function) are regularized less than directions of less curv curvature. ature. Of course, Parameter values onding to directions of significant ature (of the in the con context text of early corresp stopping, this really means that parameterscurv that correspond ob jectiv e function) are regularized less tend than to directions of less curvature. Of course, to directions of significan significant t curv curvature ature learn early relative to parameters in the con text of early stopping, this really means that parameters that correspond corresp corresponding onding to directions of less curv curvature. ature. to directions of significant curvature tend to learn early relative to parameters Theonding deriv derivations ations in this section hav haveeature. shown that a tra trajectory jectory of length τ ends corresp to directions of less curv 2 at a point that corresponds to a minim minimum um of the L -regularized ob objectiv jectiv jective. e. Early The deriv ations in this section hav e shown that a tra jectory of length ends stopping is of course more than the mere restriction of the tra trajectory jectory τlength; L at a point that corresponds to a minim um of the -regularized ob jectiv e. Early instead, early stopping typically in inv volv olves es monitoring the validation set error in stopping is ofthe course more at than the mere restriction tra jectory length; order to stop tra trajectory jectory a particularly goo good d pointofin the space. Early stopping instead, early stopping typically involvtesdecay monitoring the stopping validation set error in therefore has the adv advantage antage over weigh weight that early automatically order to stop the tra jectory at a particularly goo d p oint in space. Early stopping determines the correct amount of regularization while weight deca decay y requires man many y therefore has the adv antage o ver weigh t decay that early stopping automatically training exp experimen erimen eriments ts with differen differentt values of its hyperparameter. determines the correct amount of regularization while weight decay requires many training experiments with different values of its hyperparameter.
7.9
Parameter Tying and Parameter Sharing
Th Thus us far, this chapter,Twhen hav havee P discussed addingSharing constraints or penalties 7.9 Pin arameter yingweand arameter to the parameters, we ha hav ve alw alwa ays done so with resp respect ect to a fixed region or point. Th far, in this hapter, when (or we hav e discussed or penalties L2 cregularization Forusexample, weight deca decay) y) padding enalizesconstraints mo model del parameters for to the parameters, w e ha v e alw a ys done so with resp ect to a fixed region or p oint. deviating from the fixed value of zero. Ho Howev wev wever, er, sometimes we ma may y need other L F or example, regularization (or w eight deca y) p enalizes mo del parameters for ways to express our prior kno knowledge wledge about suitable values of the mo model del parameters. deviating from the fixed value of zero. However, sometimes we may need other 251 suitable values of the mo del parameters. ways to express our prior knowledge about
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Sometimes we might not know precisely what values the parameters should tak takee but we know, from knowledge of the domain and mo model del architecture, that there Sometimes we might not knowbetw precisely values the parameters should take should be some dependencies etween een thewhat mo model del parameters. but we know, from knowledge of the domain and model architecture, that there A common yp ypee of dep dependency endency that often want to express is that certain should be sometdependencies between thewemo del parameters. parameters should be close to one another. Consider the following scenario: we common typpeerforming of dependency thatclassification we often want to express is that ha hav veA tw two o mo models dels the same task (with the samecertain set of parameters should b e close to one another. Consider the following scenario: we classes) but with somewhat different input distributions. Formally ormally,, we ha have ve model havwith e twoparameters models performing same classification taskw(with the tsame set of (B ) . The A w (A) and the B with mo model del parameters wo mo models dels classes) but with somewhat different input distributions. F ormally , we ha ve model map the input to tw two o differen different, t, but related outputs: yˆ (A) = f( w(A), x) and A w with parameters and mo del B with parameters w . The two models ( B ) ( B ) yˆ = g(w , x). map the input to two different, but related outputs: yˆ = f( w , x) and Let us imagine that the tasks are similar enough (p (perhaps erhaps with similar input yˆ = g(w , x). and output distributions) that we believe the mo model del parameters should be close Let us imagine that (perhaps with similar input (A) the tasks are similar (Benough ) to eac each h other: ∀i , wi should be close to wi . We can lev leverage erage this information and output distributions) that we believe the model parameters should be close through regularization. Sp Specifically ecifically ecifically,, we can use a parameter norm penalty of the i w w . We can leverage to eac h other: , should bewclose this yinformation ( A ) ( B ) (B )k 2to 2 enalty w , w ) = kw(A) − form: Ω( Ω(w , but other 2. Here we used an L p enalt regularization. cthrough hoices are also∀ possible.Specifically, we can use a parameter norm penalty of the form: Ω(w , w ) = w . Here we used an L penalty, but other w This kind of approach was prop proposed osed by Lasserre et al. (2006), who regularized choices are also possible.k − k the parameters of one mo model, del, trained as a classifier in a sup supervised ervised paradigm, to This kind of approach w as prop osed by Lasserre et al. ( 2006 ),ervised who regularized be close to the parameters of another mo model, del, trained in an unsup unsupervised paradigm the capture parameters of one model,oftrained as aedclassifier in a sup ervised paradigm, to (to the distribution the observ observed input data). The architectures were be close to the parameters of another model, trained in an unsupervised constructed such that many of the parameters in the classifier mo model del paradigm could be (to capture the distribution of the observ ed input data). The architectures were paired to corresponding parameters in the unsup unsupervised ervised mo model. del. constructed such that many of the parameters in the classifier model could be While a parameter norm penalt enalty y is one way to regularize parameters to be paired to corresponding parameters in the unsupervised model. close to one another, the more popular way is to use constraints: to force sets of While a parameter norm penalt y is one way to regularize parameters . This method of regularization is often referred to to base parameters to be equal close to one another, the more p opular w ay is to use constraints: to force sets par arameter ameter sharing sharing,, where we in interpret terpret the various mo models dels or mo model del comp componen onen onents ts of as parameters to b e equal . This method of regularization is often referred to as sharing a unique set of parameters. A significan significantt adv advantage antage of parameter sharing ameter sharing,the where we interpret various dels orpenalt model onen ts as opar ver regularizing parameters to bethe close (via mo a norm enalty) y) comp is that only a sharing a unique set of parameters. A significan t adv antage of parameter sharing subset of the parameters (the unique set) need to be stored in memory memory.. In certain o v er regularizing the parameters to b e close (via a norm p enalt y) is only a mo models—suc dels—suc dels—such h as the con conv volutional neural netw network—this ork—this can lead tothat significant subset of the parameters (the unique set) need to b e stored in memory . In certain reduction in the memory footprint of the model. models—such as the convolutional neural network—this can lead to significant reduction in the memory footprint of the model. extensivee use Con Convolutional volutional Neural Netw Networks orks By far the most popular and extensiv of parameter sharing occurs in convolutional neur neural al networks (CNNs) applied to Convolutional computer vision. Neural Networks By far the most popular and extensive use of parameter sharing occurs in convolutional neural networks (CNNs) applied to Natural images ha have ve many statistical prop properties erties that are inv invarian arian ariantt to translation. computer vision. For example, a photo of a cat remains a photo of a cat if it is translated one pixel Natural images have many statistical properties that are invariant to translation. For example, a photo of a cat remains a photo of a cat if it is translated one pixel 252
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
to the righ right. t. CNNs take this property into accoun accountt by sharing parameters across multiple image lo locations. cations. The same feature (a hidden unit with the same weigh weights) ts) to the righ t. CNNs take this property into accoun t by sharing parameters across is computed over differen differentt lo locations cations in the input. This means that we can find a m ultiple image lo cations. The same feature hidden unit at with the same ts) i or weigh cat with the same cat detector whether the(acat app appears ears column column is computed o ver differen t lo cations in the input. This means that w e can find a i + 1 in the image. cat with the same cat detector whether the cat appears at column i or column Parameter sharing has allow allowed ed CNNs to dramatically lo low wer the num numb ber of unique i + 1 in the image. mo model del parameters and to significantly increase netw network ork sizes without requiring a P arameter sharing has allow ed CNNs to dramatically lowoferthe the bnum er of unique corresp corresponding onding increase in training data. It remains one est bexamples of mo del parameters and to significantly increase netw ork sizes without requiring ho how w to effectiv effectively ely incorp incorporate orate domain knowledge in into to the net netw work arc architecture. hitecture. a corresponding increase in training data. It remains one of the best examples of will ely be discussed in more detail in Chapter . network architecture. howCNNs to effectiv incorporate domain knowledge into 9the CNNs will be discussed in more detail in Chapter 9.
7.10
Sparse Represen Representations tations
W eigh eightt deca decay y acts byRepresen placing a penalty directly on the mo model del parameters. Another 7.10 Sparse tations strategy is to place a penalty on the activ activations ations of the units in a neural net network, work, W eigh t deca y acts by placing a penalty directly on the mo del parameters. Another encouraging their activ activations ations to be sparse. This indirectly imposes a complicated strategy is to place a p enalty on the activations of the units in a neural network, penalt enalty y on the model parameters. encouraging their activations to be sparse. This indirectly imposes a complicated 1 L W e ha have ve already discussed (in Sec. 7.1.2 ) ho how w p enalization induces a sparse penalty on the model parameters. parametrization—meaning that man many y of the parameters become zero (or close to L penalization W e ha ve already discussed (in Sec. 7.1.2 ) howhand, a sparse zero). Represen Representational tational sparsity sparsity,, on the other describ describes es ainduces represen representation tation parametrization—meaning that man y of the parameters b ecome zero (or close to where many of the elements of the represen representation tation are zero (or close to zero). zero). Represen tational , on the hand, describ es con a represen A simplified view of thissparsity distinction can other be illustrated in the context text of tation linear where many of the elements of the represen tation are zero (or close to zero). regression: A simplified view of this distinction can be illustrated inthe con text of linear 2 regression: 18 4 0 0 −2 0 0 5 0 0 −1 0 3 3 0 2 18 −2 2 0 0 15 = 04 50 00 0 0 0 3 5 0 0 (7.46) −01 03 −04 −52 −9 1 0 01 − 1 153 = 10 05 −00 − 00 −05 00 (7.46) − 45 9 1 0 0 1 0 4 m× n m ∈−R 1 0 A0∈ R− y ∈1 R n 3 0 5 −0 x − 4 R − −R x 0 R y A −14 3 1 −1 2 −5 4 2 ∈ ∈ ∈ 1 3 00 14 43 21 −23 −15 14 12 1 = −1 5 4 2 − 3 − 1 4 02 (7.47) 2 3 1 1 3 −2 3 −1 −3 0 −3 0 2 − −3 1 = −51 45 − −42 −22 −53 −12 23 (7.47) 00 1 2 3 −0 − 3 2 −3 3 n ×n y ∈23Rm 5 4 B ∈2Rm− 2 5 −1 h ∈ R −0 − R253 − − y R − h R B ∈ ∈ ∈
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
In the first expression, we ha have ve an example of a sparsely parametrized linear regression mo model. del. In the second, we ha have ve linear regression with a sparse representaIn the first expression, we ha ve an example a sparsely parametrized linear tion h of the data x . That is, h is a function of of in some sense, represen represents ts x that, regression mo del. In the second, w e ha ve linear regression with a sparse representathe information presen presentt in x, but do does es so with a sparse vector. tion h of the data x . That is, h is a function of x that, in some sense, represents Represen Representational tational regularization is accomplished by the same sorts of mechanisms the information present in x, but does so with a sparse vector. that we ha have ve used in parameter regularization. Representational regularization is accomplished by the same sorts of mechanisms Norm penalt enalty y regularization of represen representations tations is performed by adding to the that we have used in parameter regularization. loss function J a norm penalty on the repr epresentation esentation esentation,, denoted Ω( Ω(h h). As before, we Norm p enalt y regularization of represen tations is p erformed by adding to the ˜ denote the regularized loss function by J : loss function J a norm penalty on the representation, denoted Ω(h). As before, we denote the regularized loss J˜(θfunction ; X , y) =bJy (Jθ˜:; X , y) + αΩ(h) (7.48) (θ ; X , y) = con J (θtribution ; X , y) +ofαΩ( where α ∈ [0, ∞) weigh eights tsJ˜the relative contribution theh)norm penalt enalty y term,(7.48) with larger values of α corresponding to more regularization. where α [0, )1weights the relative contribution of the norm penalty term, with1 Just as an L penalt enalty y on the parameters induces parameter sparsity sparsity,, an L larger values of α corresponding to more regularization. ∈ ∞ penalt enalty y on the elements of the representation induces representational sparsit sparsity: y: Ppenalty on the parameters L L Just as an induces parameter sparsity , an 1 Ω( Ω(h h ) = ||h||1 = i |hi|. Of course, the L penalt enalty y is only one choice of penalty p enalt y on the elements of the representation induces sparsit y: that can result in a sparse represen representation. tation. Others includerepresentational the penalty derived from Ω( h ) = = L h h . Of course, the p enalt y is only one c hoice of p enalty a StudentStudent-tt prior on the representation (Olshausen and Field, 1996; Bergstra, 2011) that KL can||divergence result tation. include the )penalty derived from || in a sparse |penalties | represen and (Laro Larochelle chelle Others and Bengio , 2008 that are esp especially ecially a Studentprior ontations the representation (Olshausen and Field ; Bergstra , 2011 useful for trepresen representations with elements constrained to lie, 1996 on the unit interv interval. al.) and et KL enalties (Laro chelle and , 2008examples ) that are esp ecially Lee al.divergence (2008) andpGo Goo odfellow et al. (2009 ) bBengio oth provide ofP strategies Ptations with elements constrained to lie on the unit useful for represen interv al. 1 based on regularizing the av average erage activ activation ation across sev several eral examples, m i h(i) , to Lee et al.some (2008 ) andvGo odfellow et aal. (2009with ) both examples b e near target alue, such as vector .01provide for eac each h entry entry.. of strategies h , to based on regularizing the average activation across several examples, Other approac approaches hes obtain represen representational tational sparsit sparsity y with a hard constrain constraintt on be near some target value, such as a vector with .01 for each entry. the activ activation ation v values. alues. For example, ortho orthogonal gonal matching pursuit (Pati et al., Other approac hes obtain represen tational sparsit with a es hard t on h ythat 1993 1993)) enco encodes des an input x with the representation solv solves theconstrain constrained the activationproblem values. For example, orthogonal matching pursuit (Pati et al., optimization P 1993) encodes an input x witharg the representation kx − W hk2 , h that solves the constrained min (7.49) optimization problem h,khk0
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
7.11
Bagging and Other Ensemble Metho Methods ds
Bagging for bootstr otstrap ap aggr aggre egatingEnsemble ) is a techniqueMetho for reducing 7.11 (short Bagging and Other ds generalization error by combining several mo models dels (Breiman, 1994). The idea is to train several Bagging (short for b o otstr ap aggr gating ) isofathe technique generalization differen differentt mo models dels separately separately,, theneha have ve all models for votereducing on the output for test error b y combining several mo dels ( Breiman , 1994 ). The idea is to train examples. This is an example of a general strategy in machine learning calledseveral mo model del differen t mo separately , then ha ve all of the are models voteason the output fords test aver averaging aging aging. . Tdels echniques employing this strategy known ensemble metho methods ds. . examples. This is an example of a general strategy in machine learning called model that mo model del averaging is are thatknown different mo models dels will usually averThe agingreason . Techniques employing this works strategy as ensemble metho ds. not mak makee all the same errors on the test set. The reason that model averaging works is that different models will usually Consider for example a set of k regression models. Supp Suppose ose that eac each h model not make all the same errors on the test set. mak makes es an error i on each example, with the errors drawn from a zero-mean k regression Consider for example a set of models. oseariances that eacEh[model [2i ] = vSupp multiv ultivariate ariate normal distribution with variances E and cov covariances i j ] = mak es an error on each example, with the errors drawn from a zero-mean Then the error made by the av average erage prediction ensemble ble mo models c. P E of all the ensem E dels is m 1 ultivariate normal distribution with variances [ ] = v and covariances [ ] = expected ected squared error of the ensem ensemble ble predictor is i i . The exp the error made by the average prediction of all the ensemble models is ck. Then ! 2error . The expected squared of the X ensemble predictor is X X 1 1 2 i = 2E i + i j E (7.50) k k j 6=i 1 i 1 E i E = + (7.50) 1k k−1 k P = v+ c. (7.51) k k 1 k 1 = v + c. and c = v, the mean squared (7.51) errors are In the case where the perfectly correlated ! k k − Xes not help X at all. In the case where error reduces to v, so theX model av averaging eragingdo does In the caseare where the errors are perfectly andected , the mean squared c = vsquared the errors perfectly uncorrelated and ccorrelated = 00,, the exp expected error of the v error reduces to , so the model av eraging do es not help at all. In the case where 1 ensem ensemble ble is only k v. This means that the exp expected ected squared error of the ensemble the errorslinearly are perfectly uncorrelated and cIn=other 0, thewords, expected error of the decreases with the ensem on avsquared erage, the ensem ensemble ble size. ensemble ble v. This ensem ble is only means thatof theitsexp ected squared the ensemble will perform at least as well as any members, and iferror the of members make decreases linearly ensemble In other words, on average, ensem ble indep independen enden endent t errors,with the the ensemble willsize. p erform significantly better thanthe its mem memb bers. will perform at least as well as any of its members, and if the members make Differen Different ensem ensemble ble methods ds construct the ensem ensemble ble of models in differen different t wbaers. ys. indep endentt errors, themetho ensemble will p erform significantly better than its mem For example, eac member of the ensem could b e formed by training a completely each h ensemble ble Differen t ensem ble metho constructalgorithm the ensemor bleob ofjective modelsfunction. in differen t ways. differen different t kind of model usingds a different objective Bagging Fora example, eachallo member the ensem could e formed algorithm by trainingand a completely is metho method d that allows ws theofsame kind ble of mo model, del,btraining ob objectiv jectiv jectivee differen t kind of model using a different algorithm or ob jective function. Bagging function to be reused several times. is a method that allows the same kind of model, training algorithm and ob jective Sp bagging inv constructing k differen Specifically ecifically ecifically, involves olves differentt datasets. Each dataset function to be ,reused several times. has the same num umb ber of examples as the original dataset, but eac each h dataset is k Sp ecifically , bagging inv olves constructing differen t datasets. Each constructed by sampling with replacemen replacementt from the original dataset. Thisdataset means has the same n um b er of examples as the original dataset, but eac h dataset is that, with high probability probability,, eac each h dataset is missing some of the examples from the constructed by sampling replacemen from the examples original dataset. Thisaround means original dataset and also with con contains tains several tduplicate (on av average erage that,ofwith high probability , eacoriginal h dataset is missing some of the 2/3 the examples from the dataset are found in the the examples resulting from training original dataset and also contains several duplicate examples (on average around 255 2/3 of the examples from the original dataset are found in the resulting training
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Original dataset
First resampled dataset
First ensemble member 8
Second resampled dataset
Second ensemble member 8
Figure 7.5: A carto cartoon on depiction of ho how w bagging works. Supp Suppose ose we train an ‘8’ detector on the dataset depicted ab abov ov ove, e, con containing taining an ‘8’, a ‘6’ and a ‘9’. Supp Suppose ose we mak makee tw two o Figure A cartodatasets. on depiction how bagging works. Suppis osetowconstruct e train aneac ‘8’h detector differen differentt7.5: resampled Theofbagging training pro procedure cedure each of these on the dataset depicted ove, containing ‘8’,dataset a ‘6’ and a ‘9’. oserep weeats makthe e tw o datasets by sampling withabreplacement. Thean first omits the Supp ‘9’ and repeats ‘8’. differen resampled bagging to construct eachtoofan these On thist dataset, thedatasets. detector The learns that atraining lo loop op onpro topcedure of theisdigit corresp corresponds onds ‘8’. datasets by sampling with ‘9’ and repeats the ‘8’. On the second dataset, we replacement. rep repeat eat the ‘9’ The and first omitdataset the ‘6’. omits In thisthe case, the detector learns On this dataset, thebottom detectoroflearns thatcorresp a lo op onds on top corresp ondsindividual to an ‘8’. that a lo loop op on the the digit corresponds to of anthe ‘8’.digit Eac Each h of these On the secondrules dataset, we repbut eat the ‘9’average and omit the output ‘6’. In this the detector learns classification is brittle, if we their thencase, the detector is robust, that a lo op on the bottom of the digit corresp onds to an ‘8’. Eac h of these individual ac achieving hieving maximal confidence only when b oth loops of the ‘8’ are presen present. t. classification rules is brittle, but if we average their output then the detector is robust, achieving maximal confidence only when b oth loops of the ‘8’ are present.
set, if it has the same size as the original). Mo Model del i is then trained on dataset between een which examples are included in eac each h dataset result in i. The differences betw i set, if it has the same size as the original). Mo del is then trained differences bet etw ween the trained mo models. dels. See Fig. 7.5 for an example. on dataset i. The differences between which examples are included in each dataset result in Neural net netw works reac reach h a wide enough variety of solution points that they can differences between the trained models. See Fig. 7.5 for an example. often benefit from mo model del averaging ev even en if all of the mo models dels are trained on the same Neural net w orks reac h a wide enough v ariety of solution pointsofthat they hes, can dataset. Differences in random initialization, random selection minibatc minibatches, often benefit from mo del a v eraging ev en if all of the mo dels are trained on the same differences in hyp yperparameters, erparameters, or differen differentt outcomes of non-deterministic impledataset. Differences in random initialization, of minibatc men mentations tations of neural net netw works are often enough random to cause selection differen differentt members of hes, the differences hypeerparameters, orendent differen t outcomes of non-deterministic impleensem ensemble ble toinmak make partially indep independent errors. mentations of neural networks are often enough to cause different members of the Mo Model deltoavmak eraging is an extremely pow owerful erful and reliable metho method d for reducing ensem ble e partially independent errors. generalization error. Its use is usually discouraged when benc enchmarking hmarking algorithms Mo del a v eraging is an extremely p ow erful and reliable metho d for reducing for scien scientific tific pap papers, ers, because an any y machine learning algorithm can benefit substangeneralization error. Its use isatusually discouraged whencomputation benchmarking algorithms tially from mo model del av averaging eraging the price of increased and memory memory.. for papbenchmark ers, becausecomparisons any machineare learning canabenefit substanF or scien this tific reason, usuallyalgorithm made using single mo model. del. tially from model averaging at the price of increased computation and memory. Mac Machine hine learning con contests tests are usually won by metho methods ds using mo model del av averageragFor this reason, benchmark comparisons are usually made using a single model. ing ov dozens of mo A recent prominen example is the Netflix Grand over er models. dels. prominentt Mac Prize (Khine oren,learning 2009). contests are usually won by methods using model averaging over dozens of models. A recent prominent example is the Netflix Grand Prize (Koren, 2009). 256
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Not all tec techniques hniques for constructing ensem ensembles bles are designed to make the ensemble more regularized than the individual mo models. dels. For example, a technique called Not all techniques for constructing bles are make the ensemble boosting (Freund and Schapire , 1996b,a)ensem constructs andesigned ensem ensemble bletowith higher capacity more regularized than the individual mo dels. F or example, a technique called than the individual models. Bo Boosting osting has been applied to build ensem ensembles bles of neural bnet o osting ( F reund and Schapire , 1996b , a ) constructs an ensem ble with higher capacity netw works (Sc Sch hwenk and Bengio, 1998) by incremen incrementally tally adding neural netw networks orks to thanensemble. the individual models. osting hasapplied been applied to build bles of neural neural the Bo Boosting osting hasBo also been in interpreting terpreting anensem individual netw works (Sc hwensem enk and , 1998 ) ,b2006a y incremen tally addingadding neuralhidden networks to net netw ork as an ensemble ble Bengio (Bengio et al. ), incrementally units thethe ensemble. Bowork. osting has also been applied interpreting an individual neural to neural net network. network as an ensemble (Bengio et al., 2006a), incrementally adding hidden units to the neural network.
7.12
Drop Dropout out
Dr Drop op opout out (Drop Sriv Srivastav astav astava a et al. al.,, 2014) provides a computationally inexpensive but 7.12 out powerful metho method d of regularizing a broad family of mo models. dels. To a first appro approximation, ximation, Dr op out ( Sriv astav a et al. , 2014 ) provides a computationally inexpensive but drop dropout out can be thought of as a metho method d of making bagging practical for ensembles pow erfulmany metholarge d of regularizing a broadBagging family of mo dels.training To a first approximation, of very neural net networks. works. inv involves olves multiple mo models, dels, drop out can b e thought of as a metho d of making bagging practical for ensembles and ev evaluating aluating multiple mo models dels on each test example. This seems impractical of v ery many large neural net works.netw Bagging involves training ultiple mosuch dels, when eac each h mo model del is a large neural network, ork, since training and m ev evaluating aluating and ev aluating m ultiple mo dels on each test example. This seems impractical net netw works is costly in terms of runtime and memory memory.. It is common to use ensem ensembles bles when h mo del isnetw a large neural netw since )training evaluating such of five eac to ten neural networks— orks— Szegedy et ork, al. (2014a used six and to win the ILSVR ILSVRC— C— net w orks is costly in terms of runtime and memory . It is common to use ensem bles but more than this rapidly becomes un unwieldy wieldy wieldy.. Drop Dropout out provides an inexp inexpensive ensive of five to ten neural netw orks— Szegedy et al. ( 2014a ) used six to win the ILSVR C— appro approximation ximation to training and ev evaluating aluating a bagged ensem ensemble ble of exponentially many but more than this rapidly b ecomes un wieldy . Drop out provides an inexp ensive neural net netw works. approximation to training and evaluating a bagged ensemble of exponentially many Sp Specifically ecifically ecifically,, drop dropout out trains the ensemble consisting of all sub-netw sub-networks orks that neural networks. can be formed by remo removing ving non-output units from an underlying base netw network, ork, Sp ecifically , drop out trains the ensemble consisting of all sub-netw orks that as illustrated in Fig. 7.6. In most mo modern dern neural net netw works, based on a series of can b e formed by remo ving non-output units from an underlying netw ork, affine transformations and nonlinearities, we can effectively remo remove ve base a unit from a as illustrated Fig. 7.6its . Inoutput most mo dern works, based requires on a series of net netw work by min ultiplying value byneural zero. net This pro procedure cedure some affinet transformations nonlinearities, we can remo ve a unit from sligh slight mo modification dification for and models such as radial basiseffectively function net netw works, whic which h tak takeae netwdifference ork by mbultiplying itsunit’s output value y zero. This pro cedure requires somet the et etw ween the state andbsome reference value. Here, we presen present sligh t mo dification for models such as radial basis function net w orks, whic h the drop dropout out algorithm in terms of multiplication by zero for simplicity simplicity,, but ittak cane the difference betweentothe unit’s state andoperations some reference value. presen b e trivially modified work with other that remo remov veHere, a unitwefrom thet thewdrop net netw ork.out algorithm in terms of multiplication by zero for simplicity, but it can be trivially modified to work with other operations that remove a unit from the Recall that to learn with bagging, we define k differen differentt mo models, dels, construct k network. differen differentt datasets by sampling from the training set with replacemen replacement, t, and then k differen k Recall learn iwith bagging, t mo dels, i onto train modelthat dataset . Drop Dropout out aimswetodefine appro approximate ximate this pro process, cess, construct but with an differen ttially datasets yum sampling from the training set with ,replacemen t, and then exp exponen onen onentially largebn umb ber of neural netw networks. orks. Sp Specifically ecifically ecifically, to train with drop dropout, out, i i train model on dataset . Drop out aims to appro ximate this pro cess, but with we use a minibatch-based learning algorithm that mak makes es small steps, suc such h an as exp onen tially large n um b er of neural netw orks. Sp ecifically , to train with drop out, sto stocchastic gradien gradientt descent. Each time we load an example in into to a minibatc minibatch, h, we we use a minibatch-based learning algorithm that makes small steps, such as stochastic gradient descent. Each time257 we load an example into a minibatch, we
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
y
y
h1
h2
x1
x2
y
h1
y
h1
x2
x1
h1
y h2
h2 x1
y
x2 y
h2
h2
h2 x1
x2
x1
y x1
h2
y
h1 h1
y
y
x2
x2
y
y
x2
Base network
h1
h1
h2
x1
x2 y
x1
y
x1 y
h2
y
h1
x2
Ensemble of Sub-Networks
Figure 7.6: Drop Dropout out trains an ensem ensemble ble consisting of all sub-netw sub-networks orks that can b e constructed by removing non-output units from an underlying base net netw work. Here, we Figure 7.6: Drop out trains an ensem ble consisting of all sub-netw orks be b egin with a base net network work with two visible units and tw two o hidden units. Therethat are can sixteen constructed by removing non-output an underlying basethat netwma ork. Here, we p ossible subsets of these four units. Weunits sho show wfrom all sixteen subnet subnetworks works may y be formed with aout basedifferen network with two visible units two hidden units. There areexample, sixteen byegin dropping different t subsets of units from theand original netw network. ork. In this small subsets of these units. net Weworks show all sixteen subnet works mayconnecting be formed apossible large prop proportion ortion of thefour resulting networks ha have ve no input units or that no path by dropping t subsets of units from the original network. this small example, the input toout thedifferen output. This problem b ecomes insignificant for In netw networks orks with wider a prop ortion of the resulting networks have no input units or no path connecting la lay ylarge ers, where the probability of dropping all possible paths from inputs to outputs becomes the input to the output. This problem b ecomes insignificant for networks with wider smaller. layers, where the probability of dropping all possible paths from inputs to outputs becomes smaller.
258
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
y
h1
h2
x1
x2
y
hˆ1
µ h1
hˆ2
h1
xˆ2
xˆ1
µx1
µ h2
h2
x1
x2
µ x2
Figure 7.7: An example of forward propagation through a feedforward netw network ork using drop dropout. out. (T (Top) op) In this example, we use a feedforward netw network ork with two input units, one Figure 7.7: of forward propagation through a feedforward network using hidden lay layer er An withexample two hidden units, and one output unit. (Bottom) To perform forw forward ard drop out. (Top) this example, we use asample feedforward netw ork with wo units, one propagation withIndropout, we randomly a vector one ten entry tryinput for each input µ with (Bottom) hidden layunit er with twonet hidden and one output unit. and To perform forward or hidden in the netw work. units, The entries of µ are binary are sampled independently µ propagation with dropout, w e randomly sample a v ector with one en try for each input from each other. The probability of each en entry try b eing 1 is a hyp yperparameter, erparameter, usually 0.5 µ or hidden unit in work. Thethe entries of Eac areh binary sampled independently for the hidden la lay ythe ersnet and 0.8 for input. Each unit inand theare net netw work is m ultiplied by 0.5 fromcorresp each other. probability each en try b eing 1 con is atinues hyp erparameter, the corresponding ondingThe mask, and then of forward propagation continues through the usually rest of the 0.8 for for layThis ers and thetoinput. Eachselecting unit in the work is multiplied by net netw wthe orkhidden as usual. is equiv equivalent alent randomly one net of the sub-net sub-networks works from the corresp onding mask, and then forward through propagation Fig. 7.6 and running forward propagation it. continues through the rest of the network as usual. This is equivalent to randomly selecting one of the sub-networks from Fig. 7.6 and running forward propagation through it. 259
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
randomly sample a differen differentt binary mask to apply to all of the input and hidden units in the netw network. ork. The mask for each unit is sampled indep independently endently from all of randomly sample a differen t binary mask to apply to all of the input aand the others. The probability of sampling a mask value of one (causing unithidden to be units in the netw ork. The mask for each unit is sampled indep endently from all of included) is a hyperparameter fixed before training begins. It is not a function the others. The probability of sampling a mask v alue of one (causing a unit to be, of the curren currentt value of the mo model del parameters or the input example. Typically ypically, included) is a ishyperparameter fixed before0.8 training begins.unit It isisnot a function an input unit included with probability and a hidden included with of the curren t v alue of the mo del parameters or the input example. T ypically probabilit probability y 0.5. We then run forw forward ard propagation, bac back-propagation, k-propagation, and the, an inputup unit is as included with 7.7 probability andtoarun hidden unitpropagation is included with learning update date usual. Fig. illustrates0.8 how forw forward ard with probabilit y 0.5. W e then run forw ard propagation, bac k-propagation, and the drop dropout. out. learning update as usual. Fig. 7.7 illustrates how to run forward propagation with formally,, suppose that a mask vector µ sp specifies ecifies whic which h units to include, dropMore out. formally and J (θ, µ) defines the cost of the mo model del defined by parameters θ and mask µ . More formally , suppose that a mask vector whic h ectation units to con include, θ,ecifies µ). The EµµJ (sp Then drop dropout out training consists in minimizing exp expectation contains tains J ( θ , µ θ µt. and ) defines the cost of the mo del defined by parameters and mask exp exponen onen onentially tially man many y terms but we can obtainEan un unbiased biased estimate of its gradien gradient J ( θ, µ). The expectation contains Then dropoutvalues training consists in minimizing by sampling of µ . exponentially many terms but we can obtain an unbiased estimate of its gradient Drop Dropout out vtraining by sampling alues of is µ.not quite the same as bagging training. In the case of bagging, the mo models dels are all indep independent. endent. In the case of drop dropout, out, the mo models dels share Drop out training is not quite the same as bagging training. In the casethe of parameters, with eac each h mo model del inheriting a different subset of parameters from bagging, the mo dels are This all indep endent.sharing In the case of it drop out, the dels share paren parent t neural net network. work. parameter makes possible to mo represent an parameters, with eac h mo del inheriting a different subset of parameters from the exp exponen onen onential tial num umb ber of mo models dels with a tractable amoun amountt of memory memory.. In the case of paren t neural net work. This parameter sharing makes it possible to represent an bagging, eac each h mo model del is trained to con convergence vergence on its resp respectiv ectiv ectivee training set. In the exp onen tial n um b er of mo dels with a tractable amoun t of memory . In the case of, case of drop dropout, out, typically most mo models dels are not explicitly trained at all—usually all—usually, bagging, modelenough is trained vergence on its resp e training set. Insubthe the modeleac ishlarge thattoitcon would be infeasible toectiv sample all possible case of drop out,the typically mouniv delserse. are Instead, not explicitly net netw works within lifetimemost of the universe. a tin tiny ytrained fractionatof all—usually the possible, the model is large enough thatforit awsingle ould bstep, e infeasible sample allsharing possible subsub-net sub-netw works are eac each h trained and thetoparameter causes net w orks within the lifetime of the univ erse. Instead, a tin y fraction of the p ossible the remaining sub-net sub-networks works to arrive at go goo od settings of the parameters. These sub-net orksdifferences. are each trained a single step, follows and thethe parameter sharing causes are the w only Bey Beyond ondforthese, dropout bagging algorithm. For the remaining sub-net works to arrive at go o d settings of the parameters. These example, the training set encountered by eac each h sub-netw sub-network ork is indeed a subset of are the only differences. Bey ond these, dropout follows the bagging algorithm. For the original training set sampled with replacemen replacement. t. example, the training set encountered by each sub-network is indeed a subset of To mak makee a prediction, a bagged ensemble must accum accumulate ulate votes from all of the original training set sampled with replacement. its members. We refer to this pro process cess as infer inferenc enc encee in this con context. text. So far, our T o mak e a prediction, a bagged ensemble must accum ulate votesbefrom all of description of bagging and dropout has not required that the model explicitly its members. W ew,refer this pro cess infer encrole e inisthis context.a probability So far, our probabilistic. No Now, we to assume that theasmo model’s del’s to output description of bagging and dropout has not required that the model b e explicitly distribution. In the case of bagging, each mo model del i pro produces duces a probability distribution w, we assume the mo is toarithmetic output a mean probability (i)( y | x pprobabilistic. ). TheNo prediction of thethat ensemble is del’s giv given enrole by the of all i distribution. In the case of bagging, each mo del pro duces a probability distribution of these distributions, p ( y x). The prediction of the ensemble is given by the arithmetic mean of all k 1 X (i) of these distributions, p (y | x). (7.52) | k i=1 1 p (y x). (7.52) In the case of drop dropout, out, each sub-mo sub-model k del defined by mask vector µ defines a prob| 260 In the case of dropout, each sub-model defined by mask vector µ defines a probX
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
abilit ability y distribution p (y | x, µ ). The arithmetic mean over all masks is giv given en by ability distribution p (y x, µ )X . The arithmetic mean over all masks is given p(µ)p(y | x, µ) (7.53) by | µ p(µ)p(y x, µ) (7.53) where p( µ ) is the probabilit probability y distribution that was used to sample µ at training | time. where p( µ ) is the probability distribution that was used to sample µ at training Because this sum includes an exp exponential onential num numb ber of terms, it is intractable time. X to ev evaluate aluate except in cases where the structure of the mo model del permits some form Because this sum includes an exp onential num b er of terms, it is intractable of simplification. So far, deep neural nets are not kno known wn to permit any tractable to ev aluate except in cases where the structure of the mo del p ermits some simplification. Instead, we can appro approximate ximate the inference with sampling,form by simplification. Sothe far,output deep neural not kno wn to permit any are tractable aofveraging together from nets manyare masks. Even 10-20 masks often simplification. Instead, can approximate the inference with sampling, by sufficien sufficientt to obtain goo good d pwe erformance. averaging together the output from many masks. Even 10-20 masks are often Ho How wev ever, er, there is an even better approac approach, h, that allows us to obtain a go goo od sufficient to obtain good performance. appro approximation ximation to the predictions of the entire ensemble, at the cost of only one Ho er, there is Tan even approac h, that us to obtain a go od forw forward ardwev propagation. o do so, bwetter e change to using the allows geometric mean rather than appro ximation mean to theofpredictions of the entire the cost of only one the arithmetic the ensem ensemble ble mem memb bers’ensemble, predicted at distributions. Wardeforw ard propagation. T o do so, w e c hange to using the geometric mean rather than Farley et al. (2014) present argumen arguments ts and empirical evidence that the geometric the arithmetic mean of the ensem ble members’ predicted Wardemean performs comparably to the arithmetic mean in this distributions. context. Farley et al. (2014) present arguments and empirical evidence that the geometric The geometric mean of multiple probability distributions is not guaranteed to be mean performs comparably to the arithmetic mean in this context. a probability distribution. To guarantee that the result is a probabilit probability y distribution, The geometric mean of m ultiple probability distributions is not guaranteed bye we impose the requirement that none of the sub-models assigns probability 0 totoan any aeven probability To guarantee that the result is aunnormalized probability distribution, ev en ent, t, and wedistribution. renormalize the resulting distribution. The probabilit probability y w e impose thedefined requirement that of the sub-models 0 to any distribution directly by none the geometric mean is assigns giv given en bprobability y event, and we renormalize the resulting distribution. The unnormalized probability sY distribution defined directly by the geometric mean is given by p˜ensemble(y | x) = 2d p(y | x, µ) (7.54) µ
p˜ (y x) = p(y x, µ) (7.54) where d is the num numb ber of units that may b e dropp dropped. ed. Here we use a uniform | | distribution over µ to simplify the presentation, but non-uniform distributions are where ber of units that be dropped. the Hereensemble: we use a uniform d is theTnum also possible. o mak make e predictions we may must sre-normalize Y but non-uniform distributions are distribution over µ to simplify the presentation, p˜ensemble (y | x) the ensemble: also possible. To make predictions we must re-normalize p ensemble(y | x) = P . (7.55) ˜ensemble (y0 | x) y0 p p˜ (y x) (y x) = p . (7.55) p˜ (|y x) A key insight (Hinton et al., |2012c) in involv volv volved ed in dropout is that we can appro approxixi| p p ( y | x mate ensemble by ev evaluating aluating ) in one model: the mo model del with all units, but keywinsight (Hinton , 2012c ) involvedbyinthe dropout is that e can appro xii multiplied withAthe eigh eights ts going outetofal. unit probabilit probability y ofwincluding unit p(dification y x) in one by ev aluating thethe morigh del twith all units, imate . Thep motiv motivation ation for this mo modification is tomodel: capture right expected valuebut of P i with the w eigh ts going out of unit m ultiplied b y the probabilit y of including unit | the output from that unit. We call this approac approach h the weight sc scaling aling infer inferenc enc encee rule rule.. i. The motivation for this modification is to capture the right expected value of the output from that unit. We call this261 approach the weight scaling inference rule.
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
There is not yet any theoretical argument for the accuracy of this approximate inference rule in deep nonlinear net networks, works, but empirically it performs very well. There is not yet any theoretical argument for the accuracy of this approximate Because we usually use an inclusion probability of 12 , the weight scaling rule inference rule in deep nonlinear networks, but empirically it performs very well. usually amounts to dividing the weights by 2 at the end of training, and then using e usually use an winclusion probability of ,result the wis eight scaling rule the Because mo model del aswusual. Another ay to achiev achieve e the same to multiply the usuallyofamounts tobdividing thetraining. weights bEither y 2 at the training, andethen states the units y 2 during wayend , theofgoal is to mak make sureusing that the expected model astotal usual. Another wayattotest achiev same the result is to the the input to a unit timee isthe roughly same as multiply the exp expected ected statesinput of thetounits y 2 during Either wayhalf , thethe goal is toatmak e sure total that bunit at traintraining. time, ev even en though units train timethat are the expected total input to a unit at test time is roughly the same as the exp ected missing on average. total input to that unit at train time, even though half the units at train time are For many classes of mo models dels that do not ha hav ve nonlinear hidden units, the weigh weightt missing on average. scaling inference rule is exact. For a simple example, consider a softmax regression For many of vmo dels that do not habvye the nonlinear units, the weight classifier withclasses n input ariables represented vectorhidden v: scaling inference rule is exact. For a simple example, consider a softmax regression > classifier with n inputPv(ariables thevvector + b v. : y = y | vrepresented ) = softmaxbyW (7.56) y
v +t-wise b . multiplication (7.56) (y = y ofvsub-mo ) = softmax We can index in into to thePfamily sub-models dels byW elemen element-wise of the | input with a binary vector d: We can index into the family of sub-modelsby element-wise multiplication of the input with a binary ector (7.57) P (yv= y | vd;: d) = softmax W >(d v) + b . y
W (d the P (y = yis defined v; d) = bsoftmax v) + b . The ensem ensemble ble predictor y re-normalizing geometric mean | ensem ensemble ble mem memb bers’ predictions: The ensemble predictor is defined by re-normalizing the geometric mean P˜ (y = y | v) ensemble members’ predictions: Pensemble(y = y | v) = P ensemble 0 ˜ 0 P ensemble(y = y | v ) ˜ y P (y = y v) (y = y v ) = P ˜ where P (y = y| v) s Y | P˜ensemble(y = y | v) = 2 n P (y = y | v|; d). where d∈{0,1}n P˜ (y = y v) = P (y = y v; d). ˜ To see that the weight scaling rule is exact, we can simplify P | | Pensemble: s Y ˜ To see that the w eight scaling rule is exact, : ˜ P (simplify y = y | vP; d) P ensemble (y = y | v) = 2 n we can s Y n d∈{0,1} (y = y v) = P (y = y v; d) P˜ s Y | softmax (W> (d v) + b| ) = n y
2
d∈{0,1} n
= v u Y u = 2n =
expWW y(>0d,: (d v ) v+) b+ b exp Y 262 W ( exp d v) + b P Y
d∈{0,1}n
s
t v u u t
(d v) + b) softmax s (W Y > exp Wy,:(d v) + b y0
P
(7.57) over all over all (7.58) (7.58) (7.59) (7.59)
(7.60) (7.60) (7.61) (7.61) (7.62) (7.62)
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
qQ
> Wy,: (d v) + b = r (7.63) Q P n exp W ( d v ) + b 2 > d∈{0,1}n y 0 exp Wy 0 ,:(d v ) + b = (7.63) exp W (dmultiplication v) + b Because P˜ will be normalized, we can safely ignore by factors that are constan constantt with respect q to y: ignore multiplication Q we can safely Because P˜ will be normalized, by factors that s Y are constant with respect to y: > (d v P˜ensemble(yr=Qy | v) ∝ P (7.64) exp W y,: ) + b 2n n d∈{0,1} P˜ exp W (d v) + b (y = y v ) (7.64) | ∝ X 1 > = exp n Wy,:(d v) + b (7.65) 2 0,1}n 1 d∈{s Y v) + b = exp W (d (7.65) 2 1 > = exp W v + b (7.66) 2 y,: 1 = exp v + ab softmaxclassifier with weigh (7.66) Substituting this bac back k in into to Eq. 7.58 w2eW obtain weights ts X 1 2W . Substituting this back into Eq. 7.58 we obtain a softmax classifier with weights weigh weightt scaling rule is also exact in other settings, including regression WThe . netw net netw works with conditionally normal outputs, and deep networks orks that hav havee hidden The weigh t scaling rule is also exact in other settings, including regression la lay yers without nonlinearities. Ho How wev ever, er, the weight scaling rule is only an appro approxixinet w orks with conditionally normal outputs, and deep netw orks that hav e hidden mation for deep models that hav havee nonlinearities. Though the appro approximation ximation has la y ers without nonlinearities. Ho w ev er, the w eight scaling rule is only an oappro xinot been theoretically characterized, it often works well, empirically empirically.. Go Goo dfellow mation for deep models that hav e that nonlinearities. the appro ximation et al. (2013a ) found exp experimen erimen erimentally tally the weigh weightt Though scaling appro approximation ximation can whas ork not been characterized, it often orks wCarlo ell, empirically . Goodfellow b etter (in theoretically terms of classification accuracy) thanwMonte approximations to the et al. (ble 2013a ) found exp erimen that weigh scaling approappro ximation can wwas ork ensem ensemble predictor. This held tally true ev even en the when thetMon Monte te Carlo approximation ximation b etter (in terms of classification accuracy) than Monte Carlo approximations to the allo allow wed to sample up to 1,000 sub-net sub-netw works. Gal and Ghahramani (2015) found ensem ble predictor. This held true ev en when the Monteusing Carlotwen appro was that some models obtain better classification accuracy wenty tyximation samples and allow ed to up toximation. 1,000 sub-net works. that Gal the andoptimal Ghahramani ) found the Mon Monte te sample Carlo appro approximation. It appears choice(2015 of inference that some models obtain better classification accuracy using twenty samples and appro approximation ximation is problem-dependent. the Monte Carlo approximation. It appears that the optimal choice of inference Sriv Srivasta asta astav va et al. (2014) show showed ed that drop dropout out is more effective than other approximation is problem-dependent. standard computationally inexp inexpensive ensive regularizers, suc such h as weigh eightt decay decay,, filter astava etand al.sparse (2014)activit showyedregularization. that dropoutDrop is more effective other normSriv constraints activity Dropout out ma may y also bethan com combined bined standard inexpensive regularizers, such as weight decay, filter with othercomputationally forms of regularization to yield a further improv improvement. ement. norm constraints and sparse activity regularization. Dropout may also be combined adv advantage antage drop dropout out is that it isa very computationally withOne other forms of of regularization to yield further improvement. cheap. Using O ( n drop dropout out during training requires only ) computation per example per update, One advnantage of binary dropout isbthat it is very computationally cheap. Using to generate random num umb ers and multiply them by the state. Dep Depending ending O (n ) computation drop outimplemen during training onlyrequire pertoexample per update, O (n) memory on the implementation, tation, requires it may also store these binary n to generate random binary n um b ers and multiply them b y the state. Dep num umb bers un until til the bac back-propagation k-propagation stage. Running inference in the trained ending mo model del on the implementation, it may also require O (n) memory to store these binary 263 Running inference in the trained mo del numbers until the back-propagation stage. 2n
d∈{0,1}n exp
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
has the same cost per-example as if drop dropout out were not used, though we must pay the cost of dividing the weigh weights ts by 2 once before beginning to run inference on has the same cost per-example as if dropout were not used, though we must pay examples. the cost of dividing the weights by 2 once before beginning to run inference on Another significan significantt adv advan an antage tage of drop dropout out is that it do does es not significantly limit examples. the typ ypee of mo model del or training pro procedure cedure that can be used. It works well with nearly significan advantagerepresen of dropout is that it dobes not significantly limit an any y Another mo model del that uses a tdistributed representation tation and can e trained with sto stochastic chastic the typet of model training profeedforw cedure that be net used. It works well withmo nearly gradien gradient descen descent. t. or This includes feedforward ard can neural netw works, probabilistic models dels an y mo del that uses a distributed represen tation and can b e trained with sto chastic suc such h as restricted Boltzmann mac machines hines (Sriv Srivastav astav astava a et al. al.,, 2014), and recurren recurrentt gradien t descen t. This includes feedforw ard neural net w orks, probabilistic dels neural net netw works (Ba Bay yer and Osendorfer, 2014; Pascan ascanu u et al. al.,, 2014a). Manymo other such as restricted Boltzmann machinespow (ower Sriv a etmore al., sev 2014 ), restrictions and recurren regularization strategies of comparable er astav imp impose ose severe ere ont neural networks (ofBathe yer model. and Osendorfer, 2014; Pascanu et al., 2014a). Many other the arc architecture hitecture regularization strategies of comparable power impose more severe restrictions on Though the cost per-step of applying drop dropout out to a specific mo model del is negligible, the architecture of the model. the cost of using drop dropout out in a complete system can be significan significant. t. Because dropout the cost phnique, er-step it of reduces applying drop out to a specificofmo del is negligible, is a Though regularization tec technique, the effectiv effective e capacity a model. To offset the cost of using drop out in a complete system can be significan t. Because dropout this effect, we must increase the size of the model. Typically the optimal validation is a regularization tec hnique, it reduces the effectiv e capacity of a model. T offset set error is muc uch h lo low wer when using drop dropout, out, but this comes at the cost of ao m uc uch h this effect, we m ust increase the size of the model. Typically the optimal v alidation larger model and many more iterations of the training algorithm. For very large set error isregularization much lower when using drop out, but in this comes at the error. cost ofIn am uch datasets, confers little reduction generalization these larger the model and many more of the training algorithm. or very large cases, computational cost iterations of using dropout and larger models F may outw outweigh eigh datasets, regularization confers little reduction in generalization error. In these the benefit of regularization. cases, the computational cost of using dropout and larger models may outweigh few labeled training examples are av available, ailable, drop dropout out is less the When benefitextremely of regularization. effectiv effective. e. Ba Bay yesian neural net networks works (Neal, 1996) outp outperform erform dropout on the When eextremely few labeled training arefew aver ailable, dropout is less Alternativ Alternative Splicing Dataset (Xiong et al., examples 2011) where fewer than 5,000 examples effectiv e. Ba(y esian neural net works (Neal,additional 1996) outp erform data dropout the are available Sriv Srivasta asta astav va et al. al., , 2014 ). When unlabeled is avon ailable, Alternativ e Splicing Dataset ( Xiong et al. , 2011 ) where few er than 5,000 examples unsup unsupervised ervised feature learning can gain an adv advantage antage ov over er drop dropout. out. are available (Srivastava et al., 2014). When additional unlabeled data is available, Wager et al. (2013learning ) show showed edcan that, applied regression, drop dropout out unsup ervised feature gainwhen an adv antagetoovlinear er drop out. 2 is equiv equivalen alen alentt to L weigh eightt deca decay y, with a different weigh weightt decay co coefficient efficient for W ager et al. ( 2013 ) show ed that, when applied to linear regression, dropout eac each h input feature. The magnitude of eac each h feature’s weigh eightt deca decay y co coefficien efficien efficient t is L is equiv alen t to w eigh t deca y , with a different weigh t decay co efficient for determined by its variance. Similar results hold for other linear mo models. dels. For deep eac h input feature. The magnitude eachdecay feature’s weight decay coefficient is mo models, dels, drop dropout out is not equiv equivalent alent to of weight decay. . determined by its variance. Similar results hold for other linear models. For deep The drop sto stochasticit chasticit chasticity y used while drop dropout models, out is not equiv alent training to weightwith decay . out is not necessary for the approac approach’s h’s success. It is just a means of approximating the sum ov over er all subThe sto chasticit y used while training with drop out is not necessary the mo models. dels. Wang and Manning (2013) derived analytical approximations for to this approach’s success. It is just a means kno of approximating the overinallfaster submarginalization. Their approximation, known wn as fast dr drop op opout out sum resulted mo dels. Wang (2013) derived analytical approximations con conv vergence timeand dueManning to the reduced sto stocchasticit hasticity y in the computation to of this the marginalization. Their approximation, kno wn as fast dr op out resulted in faster gradien gradient. t. This metho method d can also be applied at test time, as a more principled con v ergence time due to the reduced stocapproximation hasticity in the oferthe (but also more computationally expensive) to computation the av average erage ov over all gradien t.orks This metho can talso be applied at test time, as aout more sub-net sub-netw w than the dweigh eight scaling appro approximation. ximation. Fast drop dropout has principled been used (but also more computationally expensive) approximation to the average over all sub-networks than the weight scaling appro 264 ximation. Fast drop out has b een used
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
to nearly match the performance of standard drop dropout out on small neural netw network ork problems, but has not yet yielded a significant improv improvement ement or been applied to a to nearly match the performance of standard dropout on small neural network large problem. problems, but has not yet yielded a significant improvement or been applied to a Just as sto stocchasticit hasticity y is not necessary to achiev achievee the regularizing effect of large problem. drop dropout, out, it is also not sufficient. To demonstrate this, Warde-F arde-Farley arley et al. (2014) Just as stochasticit y is tsnot necessary tod achiev the regularizing effect of designed control exp experimen erimen eriments using a metho method callededr drop op opout out boosting that they dropout, it alsoexactly not sufficient. To mask demonstrate Warde-Fdrop arleyout et al. designed toisuse the same noise asthis, traditional dropout but(2014 lac lack k) designed controleffect. experimen tsout using a metho d called opoutensemble boosting to that they its regularizing Drop Dropout boosting trains the dr entire join jointly tly designed to use exactly the same mask noise as traditional drop out but lac maximize the log-lik log-likeliho eliho elihoo od on the training set. In the same sense that traditionalk its regularizing effect. toDrop out bothis osting trains the entire ensemble to joinAs tly drop dropout out is analogous bagging, approach is analogous to boosting. maximize the erimen log-likeliho od drop on the set. Inw the samenosense that traditional in intended, tended, exp experimen eriments ts with dropout outtraining boosting sho show almost regularization effect drop out is analogous to bagging, this approach is analogous to b o osting. As compared to training the entire netw network ork as a single model. This demonstrates that in tended, exp erimen ts with drop out b o osting sho w almost no regularization effect the in interpretation terpretation of drop dropout out as bagging has value bey eyond ond the inte interpretation rpretation of compared training the entire netw as a single effect model.of This demonstrates that drop dropout out astorobustness to noise. The ork regularization the bagged ensemble is the in terpretation of drop out as bagging has v alue b ey ond the inte rpretation of only ac achieved hieved when the stochastically sampled ensemble mem memb bers are trained to drop out as robustness to noise. The regularization effect of the bagged ensemble is perform well independently of eac each h other. only achieved when the stochastically sampled ensemble members are trained to Drop Dropout out has inspired other sto stochastic chastic approaches to training exp exponen onen onentially tially perform well independently of each other. large ensembles of mo models dels that share weigh eights. ts. DropConnect is a sp special ecial case of Drop haseac inspired other stoeen chastic approaches to training onen tially drop dropout out out where each h pro product duct betw etween a single scalar weigh weight t and a exp single hidden largestate ensembles of models that share ts. DropConnect a sp).ecial case of unit is considered a unit that canwbeigh e dropp dropped ed (Wan et al. al.,is , 2013 Sto Stochastic chastic drop out where eac h pro duct b etw een a single scalar weigh t and a single hidden pooling is a form of randomized pooling (see Sec. 9.3) for building ensembles unit state is considered a unit bvolutional e dropped netw (Wan et attending al., 2013).toSto chastict of con convolutional volutional net networks works withthat eachcan con convolutional network ork differen different poolinglo iscations a formofofeac randomized pooling Sec. out 9.3remains ) for building ensembles spatial locations each h feature map. So (see far, drop dropout the most widely of con volutional net works with each con volutional netw ork attending to differen t used implicit ensem ensemble ble metho method. d. spatial locations of each feature map. So far, dropout remains the most widely One of the key insigh insights ts of drop dropout out is that training a netw network ork with stochastic used implicit ensemble method. beha ehavior vior and making predictions by av averaging eraging ov over er multiple sto stocchastic decisions One oftsthe key insigh ts of drop outparameter is that training a netw ork with implemen implements a form of bagging with sharing. Earlier, w wee stochastic describ described ed b eha vior and making predictions by av eraging ov er m ultiple sto c hastic decisions drop dropout out as bagging an ensemble of mo models dels formed by including or excluding implemen ts a form of bagging with parameter Earlier,towbe edescrib ed units. How Howev ev ever, er, there is no need for this mo model del avsharing. eraging strategy based on dropout as an Inensemble mo dels offormed including excluding inclusion andbagging exclusion. principle,ofan any y kind randombymo modification dification or is admissible. units. Howevwe er, m there is no need for this mo del averaging strategy toorks be based on In practice, ust choose mo modification dification families that neural netw networks are able inclusion andresist. exclusion. In ,principle, anyalso kinduse of random modification admissible. to learn to Ideally Ideally, we should mo model del families that isallow a fast In practice, we m ust choose mo dification families that neural netw orks are able appro approximate ximate inference rule. We can think of an any y form of mo modification dification parametrized toy learn to resist. Ideally,an weensemble should also use moof delp(families a fast µ as training y | x, µ)that b a vector consisting for allow all possible inference rule. We can think y form of modification vappro aluesximate of µ. There is no requirement thatofµanha hav ve a finite num numb ber ofparametrized values. For µ p ( y x , µ b y a vector as training an ensemble consisting of ) for all possible example, µ can be real-v real-valued. alued. Sriv Srivasta asta astav va et al. (2014) show showed ed that multiplying the µ µ v alues of . There is no requirement that ha v e a finite num b er of v alues. F or | weigh eights ts by µ ∼ N (1, I ) can outp outperform erform drop dropout out based on binary masks. Because example, canstandard be real-vnet alued. astava et al. implements (2014) showed that multiplying the [µ ] = 1 µthe E network workSriv automatically approximate inference weights by µ (1, I ) can outperform dropout based on binary masks. Because E [µ ] = 1 the ∼ standard network automatically implements approximate inference N 265
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
in the ensem ensemble, ble, without needing any weight scaling. So far we ha hav ve describ described ed drop dropout out purely as a means of performing efficient, in the ensemble, without needing any weight scaling. appro approximate ximate bagging. How However, ever, there is another view of drop dropout out that go goes es further So far w e ha v e describ ed drop out purely as a means of p erforming efficient, than this. Drop Dropout out trains not just a bagged ensemble of mo models, dels, but an ensem ensemble ble appro ximate bagging. How ever, there is another view of drop out that go es further of mo models dels that share hidden units. This means each hidden unit must be able to than this. Drop out trains not just a bagged but Hidden an ensem ble p erform well regardless of which other hiddenensemble units are of in mo thedels, mo model. del. units ofust mobdels that share hidden units.and This means eachbetw hidden unit must be able to m e prepared to be swapped interc interchanged hanged etween een mo models. dels. Hinton et al. erform regardless hidden units arerepro in the model.which Hidden units (p2012c ) wwell ere inspired byofanwhich idea other from biology: sexual reproduction, duction, in involv volv volves es m ust b e prepared to be swapped and interc hanged b etw een mo dels. Hinton et al. sw swapping apping genes betw etween een tw two o different organisms, creates ev evolutionary olutionary pressure for (genes 2012c)towbecome ere inspired by ango idea fromtobiology: duction, which involves not just goo od, but becomesexual readilyrepro sw swapp app apped ed betw between een different swapping genes betw een tw o different organisms, creates evolutionary pressure for organisms. Such genes and suc such h features are very robust to changes in their genes to become not just od, not butable to become readily sw apped een different en environmen vironmen vironment t because theygoare to incorrectly adapt to betw un unusual usual features organisms. Such genes and suc h features are very robust to c hanges in to their of an any y one organism or mo model. del. Drop Dropout out th thus us regularizes each hidden unit be environmen ecause they but are not able tothat incorrectly adapt to con unusual not merely at bgo goo od feature a feature is go goo od in many contexts. texts. features Wardeofarley any et oneal.organism or model.drop Drop thus regularizes hidden unit be F (2014) compared dropout outout training to training each of large ensem ensembles blestoand not merely that a goodrop d feature but aadditional feature that is go od in many contexts. Werror ardeconcluded dropout out offers improv improvements ements to generalization F arley et al. ( 2014 ) compared drop out training to training of large ensem bles and bey eyond ond those obtained by ensembles of independent mo models. dels. concluded that dropout offers additional improvements to generalization error It is imp important ortant to understand that a large portion of the pow ower er of drop dropout out beyond those obtained by ensembles of independent models. arises from the fact that the masking noise is applied to the hidden units. This impas ortant to ofunderstand that t, a large portion of the of pow er information of dropout can It beisseen a form highly in intelligen telligen telligent, adaptiv adaptive e destruction the arises from the fact that the masking noise is applied to the hidden units. con conten ten tentt of the input rather than destruction of the ra raw w values of the input.This For can be seen as amodel form learns of highly intelligen e destruction offinding the information example, if the a hidden unitt,hiadaptiv that detects a face by the nose, conten t of the input rather than of the raw values of theisinput. or hi corresp then dropping corresponds onds to destruction erasing the information that there a noseFin example, hidden unit hh that detects aredundantly face by finding the the image.if the Themodel mo model dellearns must alearn another tly enco encodes desnose, the i , either that redundan h then dropping corresp onds to erasing the information that there is a nose in presence of a nose, or that detects the face by another feature, suc such h as the mouth. the image. The del must learn another , either that redundan tlythe enco des the T raditional noisemo injection tec techniques hniques that h add unstructured noise at input are presence of a nose, or that detects the face by another feature, suc h as the mouth. not able to randomly erase the information about a nose from an image of a face T raditional noise injection tecnoise hniques that addthat unstructured the input are unless the magnitude of the is so great nearly all noise of theatinformation in not able to randomly erase the information about a nose from an image of a face the image is remov removed. ed. Destroying extracted features rather than original values unless the destruction magnitude of the is so use great all of theab information in allo allows ws the pro process cessnoise to make of that all ofnearly the kno knowledge wledge about out the input the image is that remov ed.model Destroying extracted features rather than original values distribution the has acquired so far. allows the destruction process to make use of all of the knowledge about the input Another imp importan ortant t asp aspect ect of acquired drop dropout out is distribution thatortan the model has so that far. the noise is multiplicative. If the noise were additive with fixed scale, then a rectified linear hidden unit h i with Another ortansimply t aspect of drop oute ish that the noise is multiplicative. make If the added noise imp could learn to hav have e i b ecome very large in order to mak h noise were additive with fixed scale, then a rectified linear hidden unit with the added noise insignifican insignificantt by comparison. Multiplicativ Multiplicativee noise do does es not allo allow w h added noise could simply learn to hav e b ecome v ery large in order to mak e suc such h a pathological solution to the noise robustness problem. the added noise insignificant by comparison. Multiplicative noise does not allow deep learning batch h normalization, suchAnother a pathological solutionalgorithm, to the noisebatc robustness problem. reparametrizes the mo model del in a wa way y that in intro tro troduces duces both additive and multiplicative noise on the Another deep learning algorithm, batch normalization, reparametrizes the model in a way that introduces both 266 additive and multiplicative noise on the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
+ .007 ×
=
+ .007
= ×
x+ sign sign((∇ x J (θ, x, y)) x +on” y =“panda” “nemato “nematode” “gibb “gibbon” x sign ( Jde” (θ, x, y)) sign ( J%(θ, x, y)) w/ 57.7% w/∇8.2% w/ 99.3 yconfidence =“panda” “nemato de” “gibb ∇on” confidence confidence w/ 57.7% w/ 8.2% w/ 99.3 % Figure 7.8: Aconfidence demonstration of adv adversarial ersarial example generation applied to GoogLeNet confidence confidence x
sign sign((∇ xJ (θ, x, y))
(Szegedy et al., 2014a) on ImageNet. By adding an imp imperceptibly erceptibly small vector whose Figure ts 7.8: demonstration example generation GoogLeNet elemen elements areAequal to the signofofadv theersarial elements of the gradient of applied the cost to function with et al. (resp Szegedy , 2014a ) on ImageNet. By adding an imp erceptibly small v ector whose respect ect to the input, we can change GoogLeNet’s classification of the image. Repro Reproduced duced elemen ts are equal to Goo the dfellow sign of et theal.elements with p ermission from Goodfellow (2014b).of the gradient of the cost function with resp ect to the input, we can change GoogLeNet’s classification of the image. Repro duced with p ermission from Goo dfellow et al. (2014b).
hidden units at training time. The primary purpose of batch normalization is to impro improv ve optimization, but the noise can ha hav ve a regularizing effect, and sometimes hidden units at training time. The primary purpose of batchfurther normalization is to. mak makes es drop dropout out unnecessary unnecessary.. Batc Batch h normalization is described in Sec. 8.7.1 improve optimization, but the noise can have a regularizing effect, and sometimes makes dropout unnecessary. Batch normalization is described further in Sec. 8.7.1.
7.13
Adv dversarial ersarial Training
In many A cases, neural netw networks hav havee begun to reach human performance when 7.13 dversarial Torks raining ev evaluated aluated on an i.i.d. test set. It is natural therefore to wonder whether these In many cases, neural anetw have begun to reach human performance mo models dels ha have ve obtained trueorks human-lev uman-level el understanding of these tasks. In when order ev aluated on an i.i.d. test set. It is natural therefore to wonder whether to prob probee the lev level el of understanding a net netw work has of the underlying task, wethese can mo dels haexamples ve obtained a true human-lev el understanding these tasks. In order searc search h for that the mo model del misclassifies. Szegedy etofal. (2014b ) found that to en prob e thenet levworks el of understanding a net work level has ofaccuracy the underlying task, w100% e can ev even neural networks that perform at human ha hav ve a nearly searchrate for on examples thatthat theare moin delten misclassifies. Szegedy bety al. (2014b ) found that error examples inten tentionally tionally constructed using an optimization ev en neural net works that p erform at h uman level accuracy ha v e a nearly 100% pro procedure cedure to searc search h for an input x0 near a data point x suc such h that the mo model del output error rate on examples are yincases, tentionally constructed by using optimization x0 can is very different at x 0. that In man many be so similar to xan that a human x x pro cedure to searc h for an input near a data p oint suc h that the mo del output observ observer er cannot tell the difference betw etween een the original example and the adversarial x x x is very different .ork In can manmake y cases, be so similar to See thatFig. a human example example, , but theat netw network highlycan differen different t predictions. 7.8 for observ er cannot tell the difference b etw een the original example and the adversarial an example. example, but the network can make highly different predictions. See Fig. 7.8 for Adv dversarial ersarial examples ha hav ve many implications, for example, in computer security security,, an example. that are bey eyond ond the scop scopee of this chapter. Ho Howev wev wever, er, they are interesting in the A dv ersarial examples ha v e many implications, for security con context text of regularization because one can reduce the example, error rateinoncomputer the original i.i.d., that are b ey ond the scop e of this chapter. Ho wev er, they are interesting in the test set via adversarial tr training aining aining—training —training on adv adversarially ersarially perturb erturbed ed examples context of regularization because one can reduce the error rate on the original i.i.d. 267 on adversarially p erturb ed examples test set via adversarial training—training
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
from the training set (Szegedy et al., 2014b; Go Goo odfello dfellow w et al. al.,, 2014b). Go Goo odfello dfellow w et al. (2014b) sho showed wed that one of the primary causes of these from the training set (Szegedy et al., 2014b; Goodfellow et al., 2014b). adv adversarial ersarial examples is excessiv excessivee linearit linearity y. Neural net networks works are built out of Go o dfello w et al. ( 2014b ) sho wed that one of the primary of these primarily linear building blo blocks. cks. In some exp experiments eriments the ov overall erallcauses function they adv ersarial examples is excessiv e linearit y . Neural net works are built out of implemen implementt pro proves ves to be highly linear as a result. These linear functions are easy primarily linear building blo cks.value In some experiments erall function they to optimize. Unfortunately Unfortunately, , the of a linear functionthe canovchange very rapidly implemen proves to be highly as aeac result. These are easy if it has ntumerous inputs. If wlinear e change each h input by linear , then functions a linear function to optimize. , the aluemuc of cancan change w can change with weights Unfortunately by vas uch ha linear as ||wfunction be avery veryrapidly large || 1, which if it has n umerous inputs. If w e c hange eac h input b y , then a linear function amoun amountt if w is high-dimensional. Adv Adversarial ersarial training discourages this highly w linear with weights can change bybas much as the , which can e cally a very large w netw sensitiv sensitive e lo locally cally beha ehavior vior y encouraging network ork to beblo locally constant amoun t if wborho is high-dimensional. ersarial training discourages this highly || be seen in the neigh neighb orhoo od of the training Adv data. This||can as a way of explicitly sensitiv e locally linear behaviorprior by encouraging the netw orknets. to be locally constant in intro tro troducing ducing a local constancy into supervised neural in the neighborhood of the training data. This can be seen as a way of explicitly Adv dversarial ersarial training helps to illustrate the pow ower er of using a large function introducing a local constancy prior into supervised neural nets. family in com combination bination with aggressiv aggressivee regularization. Purely linear mo models, dels, lik likee A dv ersarial training helps to illustrate the p ow er of using a large function logistic regression, are not able to resist adversarial examples because they are family to in be com bination withnet aggressiv e regularization. Purely linear mocan dels,range like forced linear. Neural networks works are able to represent functions that logistic regression, not lo able resist adversarial examples becausetothey are from nearly linear toare nearly locally callytoconstant and thus hav have e the flexibility capture forced trends to be linear. works are still able learning to represent functions can range linear in the Neural trainingnet data while to resist lo local cal that perturbation. from nearly linear to nearly locally constant and thus have the flexibility to capture Adv dversarial ersarial alsodata provide means of accomplishing semi-supervised linear trends in examples the training whilea still learning to resist local perturbation. learning. At a point x that is not asso associated ciated with a lab label el in the dataset, the Advitself ersarial examples means of accomplishing yˆ. The amo yˆ ma mo model del assigns some also labelprovide model’s del’s label may y not besemi-supervised the true lab label, el, learning. a point is ,not asso withprobability a label in the dataset, the x that yˆ ciated but if the At mo model del is high quality quality, then has a high of providing the yˆ. The moexample yˆ maycauses modellab itself assigns label del’s label not bethe theclassifier true labto el, x 0 that true label. el. W e can some seek an adversarial y ˆ but if the moeldely0 iswith highy0quality thenersarial has aexamples high probability of using providing the = 6 yˆ. , Adv output a lab label Adversarial generated not the x true lab el. W e can seek an adversarial example that causes the classifier to true label but a lab label el pro provided vided by a trained mo model del are called virtual adversarial y with = yˆ).. The output a (lab elato Advclassifier ersarial examples not the examples Miy Miyato et al. al.,,y2015 ma may y thengenerated be trainedusing to assign the true label but a lab el pro vided by a trained mo del are called virtual adversarial 0 6 same label to x and x . This encourages the classifier to learn a function that is examples Miyato et al., an 2015 ). The classifier may thenwhere be trained to assign the robust to (small changes anywhere ywhere along the manifold the unlab unlabeled eled data x x same label to and . This encourages the classifier to learn a function that is lies. The assumption motiv motivating ating this approac approach h is that different classes usually lie robust to small cmanifolds, hanges anywhere alongpthe manifoldshould wherenot thebunlab data on disconnected and a small erturbation e ableeled to jump lies. The motiv ating this approac h is that different classes usually lie from one assumption class manifold to another class manifold. on disconnected manifolds, and a small perturbation should not be able to jump from one class manifold to another class manifold.
7.14
Tangen angentt Distance, T Tangen angen angentt Prop, and Manifold Tangen angentt Classifier 7.14 Tangent Distance, Tangent Prop, and Manifold Man Many y mac machine learning algorithms aim to ov overcome ercome the curse of dimensionalit dimensionality y Thine angen t Classifier by assuming that the data lies near a low-dimensional manifold, as described in Many machine learning algorithms aim to overcome the curse of dimensionality by assuming that the data lies near a low-dimensional manifold, as described in 268
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Sec. 5.11.3. One of the early attempts to take adv advantage antage of the manifold hyp ypothesis othesis is the Sec. 5.11.3. tangent distanc distancee algorithm (Simard et al., 1993, 1998). It is a non-parametric One of thebor early attempts take advmetric antage used of theis manifold hypothesis is the nearest-neigh nearest-neighb algorithm in to whic which h the not the generic Euclidean tangent distanc e algorithm (Simard et al. , wledge 1993, 1998 ). Itmanifolds is a non-parametric distance but one that is derived from kno knowledge of the near whic which h nearest-neigh bor algorithm in assumed which the metric used is not generic Euclidean probabilit probability y concen concentrates. trates. It is that we are trying to the classify examples and distance but one thatsame is derived from knothe wledge the manifolds near which that examples on the manifold share sameofcategory category. . Since the classifier probabilit concen trates. iscal assumed we are trying to classify and should beyin inv varian ariant t to theItlo local factors that of variation that corresp correspond ond examples to mov movement ement that examples on the same manifold share the same category . Since the classifier on the manifold, it would make sense to use as nearest-neigh nearest-neighb bor distance bet etw ween should be and invarian t todistance the localbfactors of variation that corresp ondtotowhic movhement x M M p oin oints ts x the etw etween een the manifolds and which they 1 2 1 2 on the manifold, it would make sense to use as nearest-neigh b or distance b et w een resp respectiv ectiv ectively ely belong. Although that may be computationally difficult (it would M and M oftopoints points xsolving and xan the distance bproblem, etween the whichonthey require optimization to manifolds find the nearest pair M1 resp ectiv ely b elong. Although that may b e computationally difficult (it would and M2 ), a cheap alternativ alternativee that makes sense lo locally cally is to approximate Mi by its require to find the of ts, points on een M tangen tangentt solving plane atan and measure problem, the distance betw etween een nearest the tw twoo pair tangen tangents, or betw etween x ioptimization M ),t aplane M by its cheapand alternativ e that locally to approximate aand tangen tangent a poin oint. t. Thatmakes can bsense e ac achiev hiev hieved ed byissolving a lo low-dimensional w-dimensional tangensystem t plane (in at x measureofthe betwOf eencourse, the twothis tangen ts, or brequires etween linear theand dimension thedistance manifolds). algorithm a tangen t plane and a p oin t. That can b e ac hiev ed b y solving a lo w-dimensional one to specify the tangent vectors. linear system (in the dimension of the manifolds). Of course, this algorithm requires In a related spirit, the tangent pr prop op algorithm (Simard et al., 1992) (Fig. 7.9) one to specify the tangent vectors. trains a neural net classifier with an extra penalty to make eac each h output f (x) of In a related spirit, the tangent pr op algorithm ( Simard et al. 1992) factors (Fig. 7.9 the neural net lo locally cally inv invariant ariant to known factors of variation. ,These of) f ( x trains a neural net classifier with an extra p enalty to make eac h output ) of variation corresp correspond ond to mo movemen vemen vementt along the manifold near which examples of the the neural locally ariant to knownisfactors of vby ariation. These ∇xf (factors x) to bof same class net concen concentrate. trate.inv Lo Local cal in inv variance achiev achieved ed requiring e vorthogonal ariation corresp ond to mo vemen t along the manifold near which examples of the ( i ) to the kno known wn manifold tangent vectors v at x , or equiv equivalently alently that x) to bea same class concen trate. Lo cal in v ariance is achiev ed by requiring ( i ) the directional deriv derivativ ativ ativee of f at x in the directions v be small byf (adding v x orthogonal to the kno wn manifold tangent vectors at , or equiv alently that ∇ regularization penalt enalty y Ω: the directional derivative of f at x in the directions v be small by adding a 2 X regularization penalty Ω: > (i) Ω(f ) = (∇xf (x)) v . (7.67) i
Ω(f ) = ( f (x)) v . (7.67) This regularizer can of course by scaled by an appropriate hyperparameter, and, for ∇ most neural netw networks, orks, we would need to sum ov over er man many y outputs rather than the lone This regularizer can ed of course bysimplicity scaled by. an hyperparameter, and, for f ( x) describ output described here for simplicity. As appropriate with the tangent distance algorithm, mosttangen neuraltnetw orks,are we derived would need to sumusually over man youtputs ratherknowledge than the lone X the tangent vectors a priori, from the formal of f ( x output ) describ ed here for simplicity . As with the tangent distance algorithm, the effect of transformations such as translation, rotation, and scaling in images. the tangen t vectors areused derived a priori, usually from the formal T angen angent t prop has been not just for supervised learning (Simardknowledge et al., 1992of) the transformations such as translation, and).scaling in images. but effect also inofthe con learning (rotation, Thrun, 1995 context text of reinforcement Tangent prop has been used not just for supervised learning (Simard et al., 1992) Tangen angentt propagation is closely related to dataset augmen augmentation. tation. In both but also in the context of reinforcement learning (Thrun, 1995). cases, the user of the algorithm enco encodes des his or her prior knowledge of the task T angen t propagation is closely related dataset tation. by sp specifying ecifying a set of transformations that to should not augmen alter the outputInofboth the cases, the user of the algorithm encodes his or her prior knowledge of the task by specifying a set of transformations269 that should not alter the output of the
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
net netw work. The difference is that in the case of dataset augmentation, the net netw work is explicitly trained to correctly classify distinct inputs that were created by applying network. difference is that in the dataset augmentation, thepropagation network is more thanThe an infinitesimal amount of case theseoftransformations. Tangent explicitly correctlyvisiting classify adistinct inputs that wInstead, ere created by applying do does es not trained require to explicitly new input point. it analytically more than an infinitesimal amount of these transformations. T angent propagation regularizes the mo model del to resist perturbation in the directions corresp corresponding onding to doesspecified not require explicitly visiting a new point.approac Instead, analytically the transformation. While this input analytical approach h isitintellectually regularizes thetwo model to drawbac resist perturbation theregularizes directionsthe corresp onding to elegan elegant, t, it has ma major jor drawbacks. ks. First, it in only mo model del to resist the specified ptransformation. Whiledataset this analytical approac h is intellectually infinitesimal erturbation. Explicit augmen augmentation tation confers resistance to elegan t, it has t wo ma jor drawbac ks. First, it only regularizes the mo resist larger perturbations. Second, the infinitesimal approach poses difficultiesdel fortomo models dels infinitesimal perturbation. Explicit dataset augmen tation confers to based on rectified linear units. These mo models dels can only shrink theirresistance deriv derivativ ativ atives es larger p erturbations. Second, the infinitesimal approach p oses difficulties for mo dels by turning units off or shrinking their weigh weights. ts. They are not able to shrink their based on rectified linear units. These mo dels can only shrink deriv es tanh deriv derivativ ativ atives es by saturating at a high value with large weigh eights, ts, as their sigmoid orativ b y turning units off or shrinking their weigh ts. They are not able to shrink their units can. Dataset augmentation works well with rectified linear units because deriv ativ es by saturating a high with large weigh ts, as sigmoid or tanh differen different t subsets of rectifiedatunits canvalue activ activate ate for differen different t transformed ve versions rsions of units can. Dataset augmentation works well with rectified linear units b ecause eac each h original input. different subsets of rectified units can activate for different transformed versions of angent t propagation is also related to double backpr ackprop op (Druck Drucker er and LeCun LeCun,, eachTangen original input. 1992 1992)) and adv adversarial ersarial training (Szegedy et al., 2014b; Go Goo odfellow et al., 2014b). Tangen tkprop propagation is also to to double backpr op (Druck er andtraining LeCun, Double bac backprop regularizes therelated Jacobian be small, while adv adversarial ersarial 1992 and adv ersarial traininginputs (Szegedy al., 2014b ; Go et al.the , 2014b ). finds) inputs near the original andettrains the mo model delodfellow to pro produce duce same Double on backprop regularizes the Jacobian to bTeangen small,t while adversarial output these as on the original inputs. angent propagation and training dataset finds inputs near the original inputs and trains the mo del to pro duce the augmen augmentation tation using man manually ually sp specified ecified transformations both require thatsame the output on these asvarian on the angent propagation and mo model del should be in inv ariant t tooriginal certain inputs. sp specified ecifiedTdirections of change in thedataset input. augmenbac tation using ually sptraining ecified transformations both that the Double backprop kprop andman adv adversarial ersarial both require that the require mo model del should be mo del should b e in v arian t to certain sp ecified directions of c hange in the input. in inv varian ariantt to al alll directions of change in the input so long as the change is small. Just Double bac kprop and advis ersarial training both require the motdel should be as dataset augmen augmentation tation the non-infinitesimal versionthat of tangen tangent propagation, invarian t to training al l directions change in the input so long as the change is small. Just adv adversarial ersarial is theofnon-infinitesimal version of double bac backprop. kprop. as dataset augmentation is the non-infinitesimal version of tangent propagation, manifold tangen tangent classifier (Rifai etversion al., 2011c ), eliminates the need to advThe ersarial training is thet non-infinitesimal of double backprop. kno know w the tangent vectors a priori. As we will see in Chapter 14, auto autoenco enco encoders ders can The manifold tangen t classifier ( Rifai et al. , 2011c ), eliminates the needuse to estimate the manifold tangen tangentt vectors. The manifold tangen tangentt classifier makes knothis w the tangent to vectors priori. As we will see in Chapter 14, auto enco ders can of technique avoida needing user-sp user-specified ecified tangent vectors. As illustrated estimate the ,manifold tangent tangen vectors.t vectors The manifold tangen t classifier use in Fig. 14.10 these estimated tangent go beyond the classical makes in inv variants of this technique to geometry avoid needing user-sp ecified tangent vectors. Asand illustrated that arise out of the of images (suc (such h as translation, rotation scaling) in Fig. 14.10 , these estimated tangen t vectors go b eyond the classical in v ariants and include factors that must be learned because they are ob object-sp ject-sp ject-specific ecific (such as that arise out of the geometry of images (suc h as translation, rotation and scaling) mo b o dy parts). The algorithm prop with the manifold tangen classifier moving ving proposed osed tangentt and include factors beauto learned bder ecause they are ject-specific (such by as is therefore simple:that (1) must use an autoenco enco encoder to learn the ob manifold structure mo ving body learning, parts). The prop osed ts with the manifold tangen unsup unsupervised ervised andalgorithm (2) use these tangen tangents to regularize a neural nett classifier classifier is therefore simple: (1) use an auto enco der to learn the manifold structure by as in tangen tangentt prop (Eq. 7.67). unsupervised learning, and (2) use these tangents to regularize a neural net classifier This chapter describ described as in tangen t prophas (Eq. 7.67).ed most of the general strategies used to regularize This chapter has described most of the general strategies used to regularize 270
x2
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Normal
Tangent
x1
Figure 7.9: Illustration of the main idea of the tangent prop algorithm (Simard et al., 1992) and manifold tangent classifier (Rifai et al. al.,, 2011c), which b oth regularize the al., Figure 7.9: Illustration of the main idea of the tangent algorithm Simardt et f ( x classifier output function ). Each curv curvee represents theprop manifold for a (differen different class, 1992 ) and manifold classifier manifold (Rifai et al. , 2011c),inwhich b oth regularize the illustrated here as a tangent one-dimensional embedded a tw two-dimensional o-dimensional space. f ( x ) classifier outputwefunction e represents thea manifold forisa tangen differen On one curve, hav havee chosen. aEach singlecurv p oint and dra drawn wn vector that tangent t ttoclass, the illustrated here(parallel as a one-dimensional embedded a twthat o-dimensional space. class manifold to and touchingmanifold the manifold) and a in vector is normal to the On one curve, we hav e chosen a single p oint and dra wn a v ector that is tangen t to the class manifold (orthogonal to the manifold). In multiple dimensions there may b e many class manifold (parallel to and touching the manifold) and athe vector that is normal to the tangen tangent t directions and man many y normal directions. We expect classification function to manifold to the the direction manifold). In multiple there b e many cclass hange rapidly (orthogonal as it mov moves es in normal to the dimensions manifold, and notmay to change as tangen t directions y normal directions. e expect theand classification function to it mov moves es along the and classman manifold. Both tangen tangenttW propagation the manifold tangent change rapidly as itf mov es not in the direction normal and not to change as (x) to x the classifier regularize change very muc much h asto mo mov vmanifold, es along the manifold. Tangent it moves along the class Both tangen t propagation and compute the manifold propagation requires themanifold. user to manually sp specify ecify functions that the tangent tangent (xecifying ) to notthat classifier regularize change very much as x of moimages ves along the manifold. Tangent directions (such as fsp specifying small translations remain in the same class propagation requires the usertangent to manually sp ecify functions that compute the tangent manifold) while the manifold classifier estimates the manifold tangent directions directions as spncoder ecifying small translations of images remain inders theto same class b y training(such an autoe autoencoder to that fit the training data. The use of auto autoenco enco encoders estimate manifold) while the manifold tangent classifier estimates the manifold tangent directions manifolds will be describ described ed in Chapter 14. by training an autoe ncoder to fit the training data. The use of auto enco ders to estimate manifolds will be describ ed in Chapter 14.
neural netw networks. orks. Regularization is a central theme of machine learning and as suc such h will be revisited perio periodically dically by most of the remaining chapters. Another cen central tral neural netw orks. Regularization is a central theme of machine learning and as suc h theme of mac machine hine learning is optimization, describ described ed next. will be revisited periodically by most of the remaining chapters. Another central theme of machine learning is optimization, described next.
271
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Algorithm 7.1 The early stopping meta-algorithm for determining the best amoun amountt of time to train. This meta-algorithm is a general strategy that works Algorithm The early stopping meta-algorithm determining well with a v7.1 ariet ariety y of training algorithms and wa ways ys offor quan quantifying tifying errorthe onbest the amoun t of time to train. This meta-algorithm is a general strategy that works validation set. well variet y of training and ways of quantifying error on the Letwith n bea the number of stepsalgorithms betw etween een ev evaluations. aluations. validation Let p beset. the “patience,” the num umb ber of times to observ observee worsening validation set Let n b e the n umber of steps b etw een ev aluations. error before giving up. Let pθ bbe ethe the number of times to observe worsening validation set Let the“patience,” initial parameters. o error θ ← θboefore giving up. iLet ← θ0 be the initial parameters. jθ ← 0θ ← 0∞ vi ← j ∗←←0 θ θ ←i iv∗ ← θ ← ∞θj < p do while i Up ←idate θ by running the training algorithm for n steps. Update while i←← ij +
272
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
Algorithm 7.2 A meta-algorithm for using early stopping to determine how long to train, then retraining on all the data. Algorithm 7.2 A meta-algorithm for using early stopping to determine how long train) ) X(then and y (trainon beallthe to Let train, retraining thetraining data. set. ( train ) ( train ) Split X and y in into to (X (subtrain) , X (valid)) and ( y(subtrain) , y (valid)) Let X be the training set. resp respectiv ectiv ectively ely ely.. and y X y X ) starting y using X, (ysubtrain)) Split and into (7.1 , Xfrom random ) and ( θ Run early stopping (Algorithm respectiv ely. ) for training data and X (valid) and y (valid) for validation data. This and y(subtrain Run early stopping (Algorithm ) starting from random θ using X returns i∗ , the optimal number of7.1 steps. and θy to random for vtraining data and X and y for validation data. This Set alues again. returns i , the optimal n umber of steps. Train on X(train) and y (train) for i∗ steps. Set θ to random values again. Train on X and y for i steps.
objecjecAlgorithm 7.3 Meta-algorithm using early stopping to determine at what ob tiv tivee value we start to ov overfit, erfit, then contin continue ue training until that value is reached. Algorithm 7.3 Meta-algorithm using early stopping to determine at what ob jective value we start to overfit, then continue training until that value is reached. Let X(train) and y (train) be the training set. Split X(train) and y(train) in into to (X (subtrain) , X (valid)) and ( y(subtrain) , y (valid)) Let X be the training set. resp respectiv ectiv ectively ely ely.. and y X stopping X ) starting y using X, (ysubtrain)) Split early and y(Algorithm into (7.1 , Xfrom random ) and ( θ Run resp ectiv ely . and y(subtrain) for training data and X (valid) and y (valid) for validation data. This Run early up θ. stopping (Algorithm 7.1) starting from random θ using X updates dates for training data)) and X and y for validation data. This ) , y (subtrain and ← yJ (θ, X(subtrain updatesJ (θθ., X(valid) , y (valid)) > do while y y (train)) for n steps. Train J (θ,on X X(train) ,and ,y while J (θ, X ) > do ← while end Train on X and y for n steps. end while
273
Chapter 8 Chapter 8
Optimization for Training Deep Mo Models dels Optimization for Training Deep Models Deep learning algorithms in inv volv olvee optimization in man many y con contexts. texts. For example,
Deep learning algorithms involve optimization in many contexts. For example, performing inference in mo models dels suc such h as PCA in involv volv volves es solving an optimization Deep learning algorithms in v olv e optimization in man y con texts. Foralgorithms. example, problem. We often use analytical optimization to write pro proofs ofs or design performing in models suc h as PCA involv es deep solving an optimization Of all of theinference many optimization problems inv involv olv olved ed in learning, the most problem. W e often use analytical optimization to write pro ofs or design algorithms. difficult is neural netw network ork training. It is quite common to in inv vest days to months of Of all of the many optimization problems inv olv ed in deep learning, theneural most time on hundreds of mac machines hines in order to solve even a single instance of the difficult is neural netw ork training. It is quite common to in v est days to months of net netw work training problem. Because this problem is so imp important ortant and so exp expensive, ensive, on hundreds of optimization machines in order to solveha even single instancefor of the neural atime sp specialized ecialized set of techniques hav ve baeen developed solving it. net w ork training problem. Because this problem is so imp ortant and so exp ensive, This chapter presen presents ts these optimization techniques for neural netw network ork training. a specialized set of optimization techniques have been developed for solving it. you are presen unfamiliar with the basic principles of for gradient-based optimization, ThisIf chapter ts these optimization techniques neural network training. we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical If you are in unfamiliar optimization general. with the basic principles of gradient-based optimization, we suggest reviewing Chapter 4. That chapter includes a brief overview of numerical This chapter fo focuses cuses on one particular case of optimization: finding the paramoptimization in general. eters θ of a neural net netw work that significantly reduce a cost function J (θ), which This chapter fo cuses on one particular of optimization: finding the paramtypically includes a performance measurecase ev evaluated aluated on the entire training set as eters of a neuralregularization network that terms. significantly reduce a cost function J (θ), which w ell asθ additional typically includes a performance measure evaluated on the entire training set as egin with regularization a description of ho how w optimization used as a training algorithm wellW ase b additional terms. for a machine learning task differs from pure optimization. Next, we presen presentt several Weconcrete begin with a description of hooptimization w optimization used asnetw a training algorithm of the challenges that make of neural networks orks difficult. We for a machine learning task differs from pure optimization. Next, w e presen t several then define several practical algorithms, including both optimization algorithms of the concrete challengesfor that make optimization of neural netw orks difficult. We themselv themselves es and strategies initializing the parameters. More adv advanced anced algorithms then define several practical algorithms, including b oth optimization algorithms adapt their learning rates during training or leverage information con contained tained in themselves and strategies for initializing the parameters. More advanced algorithms adapt their learning rates during training or leverage information contained in 274 274
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the second deriv derivativ ativ atives es of the cost function. Finally Finally,, we conclude with a review of sev several eral optimization strategies that are formed by com combining bining simple optimization the second deriv ativ es of the cost function. Finally , w e conclude with a review of algorithms into higher-lev higher-level el pro procedures. cedures. several optimization strategies that are formed by combining simple optimization algorithms into higher-level procedures.
8.1
Ho How w Learning Differs from Pure Optimization
Optimization usedDiffers for training of deep mo models dels differ from traditional 8.1 Howalgorithms Learning from Pure Optimization optimization algorithms in several ways. Mac Machine hine learning usually acts indirectly indirectly.. Optimization algorithms used for training of deep mo dels differ from traditional In most machine learning scenarios, we care ab about out some performance measure optimization algorithms in several w a ys. Mac hine learning usually acts indirectly P , that is defined with resp to the test set and ma be in W respect ect may y also intractable. tractable. Wee. In most machine scenarios,. W wee reduce care abaout some pcost erformance P only indirectly J (θ) in therefore optimizelearning indirectly. different functionmeasure P , that is defined with resp ect to the test set and ma y also b e in tractable. We the hop hopee that doing so will improv improvee P . This is in contrast to pure optimization, P J ( θ therefore optimize only indirectly . W e reduce a different cost function ) in where minimizing J is a goal in and of itself. Optimization algorithms for training the hop edels thatalso doing so willinclude improvsome e P . This is in contrast tosp pure deep mo models typically sp specialization ecialization on the specific ecificoptimization, structure of J is where minimizing a goal in and of itself. Optimization algorithms for training mac machine hine learning ob objectiv jectiv jective e functions. deep models also typically include some specialization on the specific structure of ypically ypically, , the ob cost function can be written as an av average erage over the training set, macThine learning jectiv e functions. suc such h as Typically, the cost function be written as an average over the training(8.1) set, J (θ) =can E(x,y )∼pˆdata L(f (x; θ ), y ), such as E θ); θis), the where L is the per-example J (θ)loss = function, fL((x f ;(x y), predicted output when (8.1) the input is x, pˆdata is the empirical distribution. In the sup supervised ervised learning case, f (x; θ) is is the output. per-example loss function, predicted output when ywhere is theL target Throughout this chapter, wethe develop the unregularized x, pˆwhere the input iscase, is the the argumen empiricaltsdistribution. learning case, θ)the y. ervised sup supervised ervised arguments to L are f (x;In andsup How Howev ev ever, er, it is trivial y isextend the target output. Throughout this chapter, weθdevelop unregularized to this developmen development, t, for example, to include or x as the arguments, or to L f ( x ; θ y sup ervised case, where the argumen ts to are ) and . How ev er, it is trivial exclude y as argumen arguments, ts, in order to develop various forms of regularization or to extend thislearning. development, for example, to include θ or x as arguments, or to unsup unsupervised ervised exclude y as arguments, in order to develop various forms of regularization or Eq. 8.1 defines an ob objective jective function with resp respect ect to the training set. We unsupervised learning. would usually prefer to minimize the corresp corresponding onding ob objectiv jectiv jectivee function where the Eq. 8.1 defines an ob jective function with resp ect to the training set.than We exp expectation ectation is tak taken en across the data generating distribution pdata rather would usually prefer to minimize just ov over er the finite training set: the corresponding ob jective function where the expectation is taken across the data generating distribution p rather than ∗ just over the finite training J (θset: ) = E(x,y)∼pdata L(f (x; θ), y). (8.2) E J (θ) = L(f (x; θ), y). (8.2)
8.1.1
Empirical Risk Minimization
8.1.1goalEmpirical Minimization The of a machineRisk learning algorithm is to reduce the exp expected ected generalization error given by Eq. 8.2. This quantit quantity y is kno known wn as the risk. We emphasize here that Theexp goalectation of a machine learning algorithm is to reduce the expected pdatageneralization the expectation is tak taken en ov over er the true underlying distribution . If we knew error given b y Eq. 8.2 . This quantit y is kno wn as the risk . W e emphasize here task that the true distribution pdata(x, y), risk minimization would be an optimization the expectation is taken over the true underlying distribution p . If we knew 275 (x, y), risk minimization would be an optimization task the true distribution p
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
solv solvable able by an optimization algorithm. Ho Howev wev wever, er, when we do not know pdata(x, y) but only ha hav ve a training set of samples, we ha have ve a machine learning problem. ( x, y ) solvable by an optimization algorithm. However, when we do not know p The simplest wa way y to conv convert ert a machine learning problem back in into to an opbut only have a training set of samples, we have a machine learning problem. timization problem is to minimize the exp expected ected loss on the training set. This The simplest wa y to conv ert a machine learning problemdistribution back into pˆan (x,opy) means replacing the true distribution p(x, y) with the empirical timization problem is to expected on al therisk training set. This defined by the training set.minimize We now the minimize the loss empiric empirical means replacing the true distribution p(x, y) with the empirical distribution pˆ(x, y) m X defined by the training set. We now minimize empirical risk 1 the L(f (x(i) ; θ), y (i)) Ex,y∼ˆpdata (x,y)[L(f (x; θ), y)] = (8.3) m 1 i=1 E [L(f (x; θ), y)] = L(f (x ; θ), y ) (8.3) where m is the number of training examples.m Themtraining process cess of based on minimizing average erage training error is known where is the npro umber training examples. this av as empiric empirical al risk minimization minimization.. In this setting, machine learning is still very similar X The trainingard prooptimization. cess based on minimizing this average training error is known to straightforw straightforward Rather than optimizing the risk directly directly, , we as empiric al risk minimization . In this setting, machine learning is still v ery similar optimize the empirical risk, and hop hopee that the risk decreases significantly as well. to straightforw ard optimization. Rather than optimizing theh risk directly we A variety of theoretical results establish conditions under whic which the true risk, can optimize the to empirical hopamoun e thatts. the risk decreases significantly as well. b e exp expected ected decreaserisk, by vand arious amounts. A variety of theoretical results establish conditions under which the true risk can Ho How w ev ever, er,toempirical minimization is prone to ov overfitting. erfitting. Models with be exp ected decrease risk by various amounts. high capacity can simply memorize the training set. In many cases, empirical wever, empirical minimization is most proneeffective to overfitting. Models with riskHo minimization is notrisk really feasible. The mo modern dern optimization high capacity simply memorize the training set. useful In many empirical algorithms arecan based on gradien but many loss cases, functions, such gradient t descent, risk minimization is not really feasible. The most effective mo dern optimization as 0-1 loss, ha hav ve no useful deriv derivativ ativ atives es (the deriv derivative ative is either zero or undefined algorithms are based on gradien t descent, but many useful of loss functions, ev everywhere). erywhere). These tw two o problems mean that, in the context deep learning,such we as 0-1 loss, ha v e no useful deriv ativ es (the deriv ative is either zero or undefined rarely use empirical risk minimization. Instead, we must use a slightly differen differentt ev erywhere). These tw o problems mean that, in the context of deep learning, wet approac approach, h, in which the quantit quantity y that we actually optimize is even more differen different rarelythe usequan empirical riskweminimization. from quantit tit tity y that truly wan wantt to Instead, optimize.we must use a slightly different approach, in which the quantity that we actually optimize is even more different from the quantity that we truly want to optimize.
8.1.2
Surrogate Loss Functions and Early Stopping
8.1.2 Surrogate Loss Fwe unctions andab Early Stopping Sometimes, the loss function actually care about out (say classification error) is not one that can be optimized efficiently efficiently.. For example, exactly minimizing exp expected ected 0-1 Sometimes, the loss function(exp we onential actually in care out (say classification is not loss is typically in intractable tractable (exponential theabinput dimension), ev even enerror) for a linear one that can be optimized efficiently .F exactly minimizing expoptimizes ected 0-1 classifier (Marcotte and Sav Savard ard, 1992 ).orInexample, such situations, one typically is otypically tractableinstead, (exponential theasinput dimension), evenantages for a linear aloss surr surro gate loss in function which in acts a pro proxy xy but has adv advantages antages. . For classifier ( Marcotte and Sav ard , 1992 ). In such situations, one t ypically optimizes example, the negative log-likelihoo log-likelihood d of the correct class is typically used as a a surr o gate loss function instead, which acts as a pro xy butthe hasmo adv . For surrogate for the 0-1 loss. The negativ negativee log-likelihoo log-likelihood d allows model delantages to estimate example, the negative log-likelihoo d of the correct class isand typically used the conditional probability of the classes, given the input, if the mo model del as cana surrogate for the loss.pick Thethe negativ e log-likelihoo allows model to estimate do that well, then0-1 it can classes that yield d the least the classification error in the conditional probability of the classes, given the input, and if the mo del can exp expectation. ectation. do that well, then it can pick the classes that yield the least classification error in 276 expectation.
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
In some cases, a surrogate loss function actually results in being able to learn more. For example, the test set 0-1 loss often contin continues ues to decrease for a long In some cases, a surrogate loss function actually results in b eing ableusing to learn time after the training set 0-1 loss has reached zero, when training the more. F or example, the test set 0-1 loss often contin ues to decrease for a long log-lik log-likeliho eliho elihoo od surrogate. This is because ev even en when the exp expected ected 0-1 loss is zero, time after the training set 0-1 loss has reached zero, when training usingapart the one can improv improvee the robustness of the classifier by further pushing the classes log-likeach elihoother, od surrogate. is because evt en when the classifier, expected 0-1 is zero, from obtainingThis a more confiden confident and reliable th thus usloss extracting one can improve the robustness of the classifier by further the classes apart more information from the training data than would ha hav vepushing been possible by simply from each other, more t andset. reliable classifier, thus extracting minimizing the avobtaining erage 0-1aloss onconfiden the training more information from the training data than would have been possible by simply A very imp importan ortan ortantt difference betw etween een optimization in general and optimization minimizing the average 0-1 loss on the training set. as we use it for training algorithms is that training algorithms do not usually halt very importan een optimization in generalusually and optimization at aAlo local cal minim minimum. um.t difference Instead, abetw mac machine hine learning algorithm minimizes as we use it for training algorithms is that training algorithms do not usually halt a surrogate loss function but halts when a conv convergence ergence criterion based on early at a lo cal minim um. Instead, a mac hine learning algorithm usually minimizes stopping (Sec. 7.8) is satisfied. Typically the early stopping criterion is based on a surrogate loss function but haltssuch whenasa0-1 conv ergence criterion on early the true underlying loss function, loss measured on a based validation set, stopping (Sec. 7.8 ) is satisfied. T ypically the early stopping criterion is based on and is designed to cause the algorithm to halt whenev whenever er overfitting begins to occur. the true underlying function, such asloss 0-1function loss measured onlarge a validation set, T raining often halts loss while the surrogate still has deriv derivatives, atives, and is designed to cause the algorithm to halt whenev er o verfitting b egins to o ccur. whic which h is very different from the pure optimization setting, where an optimization Training often halts while the function still has large deriv algorithm is considered to ha hav ve surrogate conv converged ergedloss when the gradient becomes very atives, small. which is very different from the pure optimization setting, where an optimization algorithm is considered to have converged when the gradient becomes very small.
8.1.3
Batc Batch h and Minibatch Algorithms
One ofhmachine learning algorithms that separates them from general 8.1.3aspect Batc and Minibatch Algorithms optimization algorithms is that the ob objectiv jectiv jectivee function usually decomp decomposes oses as a sum One aspect of machine learning algorithms that separates them fromtypically general over the training examples. Optimization algorithms for mac machine hine learning optimization thatparameters the ob jectivbased e function decomp oses sum compute eachalgorithms up update date toisthe on anusually exp expected ected value of as thea cost over the training examples. Optimization macfull hinecost learning typically function estimated using only a subset ofalgorithms the terms for of the function. compute each update to the parameters based on an expected value of the cost For example, maxim maximum um likelihoo likelihood d estimation problems, when view viewed ed in log function estimated using only a subset of the terms of the full cost function. space, decomp decompose ose in into to a sum ov over er eac each h example: For example, maximum likelihood estimation problems, when viewed in log mh example: X space, decompose into a sum over eac θML = arg max log p model(x(i) , y(i); θ). (8.4) θ
i=1
θ = arg max log p (x , y ; θ). (8.4) Maximizing this sum is equiv equivalent alent to maximizing the exp expectation ectation ov over er the empirical distribution defined by the training set: Maximizing this sum is equivalent to maximizing the expectation over the X empirical distribution defined the training set: (x, y; θ). J (θ) = by Ex,y∼ˆ log pmodel (8.5) pdata E J (θ ) = log p (x, y; θ). (8.5) Most of the prop properties erties of the ob objective jective function J used by most of our optimization algorithms are also exp expectations ectations ov over er the training set. For example, the Most of the properties of the ob jective function J used by most of our opti277 over the training set. For example, the mization algorithms are also expectations
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
most commonly used prop property erty is the gradien gradient: t: J (θ)erty = Eisx,y∼ˆ ∇θprop ∇θ log (8.6) most commonly used thepdata gradien t: pmodel (x, y; θ). E J (ectation θ) = p expensive (x, y; θ). because it requires (8.6) Computing this exp expectation exactly islog very ∇ on every example ∇ ev evaluating aluating the mo model del in the entire dataset. In practice, we can Computing this exp ectation exactly very expensive because requires compute these exp expectations ectations by randomly is sampling a small num numb ber ofitexamples ev aluating the mo del on every example in the entire dataset. In practice, we can from the dataset, then taking the av average erage ov over er only those examples. compute these expectations by randomly sampling a small number of examples Recall that√the standard error of the mean (Eq. 5.46) estimated from n samples from the dataset, then taking the average over only those examples. is given by σ/ n, where σ is the true standard deviation of the value of the samples. √ that the of standard error of the mean (Eq. ) estimated from nto samples n sho TheRecall denominator shows ws that there are less5.46 than linear returns using σ / n, σ is given b y where is the true standard deviation of the v alue of the samples. more examples gradient. t. Compare two hypothetical estimates of √ to estimate the gradien Thegradient, denominator of non sho ws examples that thereand areanother less than linear to using the one based 100 based on returns 10,000 examples. √ morelatter examples to estimate gradien t. Compare two hypothetical of The requires 100 timesthe more computation than the former, butestimates reduces the the gradient, based on 100 andofanother based on 10,000algorithms examples. standard errorone of the mean onlyexamples by a factor 10. Most optimization The latter requires 100 times more computation than the former, but reduces con conv verge muc much h faster (in terms of total computation, not in terms of num numb berthe of standard of are the allo mean only a factor of 10. appro Most ximate optimization algorithms up updates) dates) error if they allow wed to by rapidly compute approximate estimates of the con verge much faster (in terms of totalthe computation, not in terms of number of gradien gradient t rather than slowly computing exact gradient. updates) if they are allowed to rapidly compute approximate estimates of the Another consideration motivating ating statistical gradien t rather than slowlymotiv computing the exactestimation gradient. of the gradient from a small num numb ber of samples is redundancy in the training set. In the worst case, all Anotherinconsideration motiv atingbestatistical of the gradient from a m samples the training set could identical estimation copies of each other. A samplingsmall num ber ofofsamples is redundancy in the training set.gradient In the worst all based estimate the gradien gradient t could compute the correct with case, a single m samples in the training could be identical each other. A samplingm times sample, using lessset computation than thecopies naiveofapproac approach. h. In practice, we based estimate of the gradien t could compute the correct gradient with a single are unlikely to truly encoun encounter ter this worst-case situation, but we ma may y find large m timesthat sample, less all computation thecontributions naive approac In gradien practice, n um umb bersusing of examples make very than similar toh.the gradient. t. we are unlikely to truly encounter this worst-case situation, but we may find large Optimization algorithms that use the en entire tire training set are called batch or numbers of examples that all make very similar contributions to the gradient. deterministic gradient metho methods, ds, because they pro process cess all of the training examples Optimization algorithms that use the en tire training are calledconfusing batch or sim simultaneously ultaneously in a large batc batch. h. This terminology can besetsomewhat deterministic metho they to prodescrib cess alleofthe theminibatch training examples b ecause the wgradient ord “batch” is ds, alsobecause often used describe used by sim ultaneously in a large batc h. This terminology can b e somewhat confusing minibatc minibatch h sto stochastic chastic gradient descen descent. t. Typically the term “batch gradien gradientt descent” b ecause the w ord “batch” is also often used to describ e the minibatch used by implies the use of the full training set, while the use of the term “batch” to describ describe e minibatc h sto chastic gradient descen t. T ypically the term “batch gradien t descent” a group of examples do does es not. F For or example, it is very common to use the term implies the use of the full training while the use of the term “batch” to describe “batc “batch h size” to describ describe e the size ofset, a minibatch. a group of examples does not. For example, it is very common to use the term Optimization algorithms that use only a single example at a time are sometimes “batch size” to describe the size of a minibatch. called sto stochastic chastic or sometimes online metho methods. ds. The term online is usually reserved Optimization algorithms that use only a time are sometimes for the case where the examples are drawnsingle from example a streamatofacon continually tinually created called stochastic sometimes online metho ds. The usually reserved examples rather or than from a fixed-size training set term ov over er online which is sev several eral passes are for the case where the examples are drawn from a stream of con tinually created made. examples rather than from a fixed-size training set over which several passes are Most algorithms used for deep learning fall somewhere in betw etween, een, using more made. 278 fall somewhere in b etween, using more Most algorithms used for deep learning
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
than one but less than all of the training examples. These were traditionally called minib minibatch atch or minib minibatch atch sto stochastic chastic metho methods ds and it is no now w common to simply call than one but less than all of the training examples. These were traditionally called them sto stochastic chastic metho methods. ds. minibatch or minibatch stochastic methods and it is now common to simply call The example stochastic chastic metho method d is sto stochastic chastic gradien gradientt descent, them stocanonical chastic metho ds. of a sto presen presented ted in detail in Sec. 8.3.1. The canonical example of a stochastic method is stochastic gradient descent, Minibatc Minibatch h sizes are generally driv driven en by the following factors: presented in detail in Sec. 8.3.1. sizes are generally drivaccurate en by theestimate followingoffactors: •Minibatc Larger hbatches provide a more the gradient, but with less than linear returns. Larger batches provide a more accurate estimate of the gradient, but with •• Multicore architectures batches. hes. less than linear returns.are usually underutilized by extremely small batc This motiv motivates ates using some absolute minim minimum um batch size, belo elow w which there Multicore architectures are usually underutilized by extremely small batches. is no reduction in the time to pro process cess a minibatch. • This motivates using some absolute minimum batch size, below which there • If all reduction examples in the the time batc batch htoare be apro processed cessed in parallel (as is typically is no proto cess minibatch. the case), then the amount of memory scales with the batch size. For many If all examples theisbatc are to bfactor e processed in size. parallel (as is typically hardw hardware are setupsinthis thehlimiting in batch • the case), then the amount of memory scales with the batch size. For many of hardware ac achiev hiev hieve e better runtime withsize. sp specific ecific sizes of arrays. • Some hardwkinds are setups this is the limiting factor in batch Esp Especially ecially when using GPUs, it is common for power of 2 batch sizes to offer Some of hardware hiev e b2etter withfrom specific sizes arrays. better kinds runtime. Typical pac ow ower er of batchruntime sizes range 32 to 256,ofwith 16 ecially when GPUs,for it is common for p o wer of 2 batch sizes to offer • Esp sometimes beingusing attempted large mo models. dels. better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 • Small batc batches hes canattempted offer a regularizing effect sometimes being for large mo dels. (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning pro process. cess. Generalization Small batc hes can offer a regularizing effect ( Wilson and Martinez ), error is often best for a batc batch h size of 1. Training with such a small, 2003 batch perhaps due to thea noise addrate to the learning process. due Generalization • size might require small they learning to main maintain tain stability to the high error is often b est for a batc h size of 1. T raining with such a small variance in the estimate of the gradient. The total runtime can be verybatch high size might small more learning rateboth to main tain stability due to learning the high due to the require need toamake steps, because of the reduced variance the estimate the gradient. The total can be vset. ery high rate and in because it tak takes esofmore steps to observe theruntime entire training due to the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe entire training set. miniDifferen Different t kinds of algorithms use different kinds ofthe information from the batc batch h in different wa ways. ys. Some algorithms are more sensitiv sensitivee to sampling error than Differen t kinds of algorithms use different kinds of information fromaccurately the miniothers, either because they use information that is difficult to estimate batchfew in different Some they algorithms are more sensitiv sampling than with samples,wa orys. because use information in wa ways yse to that amplifyerror sampling others,more. either Metho becausedsthey information thatbased is difficult to estimate accurately errors Methods thatuse compute up updates dates only on the gradien gradient t g are with few samples, or b ecause they use information in wa ys that amplify sampling usually relatively robust and can handle smaller batch sizes like 100. Second-order errors more. Metho that dates H based on the gradien t ghare metho methods, ds, whic which h use ds also thecompute Hessian up matrix andonly compute up updates dates suc such as usually −1 g relatively robust and can handle smaller batch sizes like 100. Second-order H , typically require much larger batc batch h sizes like 10,000. These large batch methoare ds, required which use the Hessian matrix compute up−1 dates such as g. Suppose sizes to also minimize fluctuations in H theand estimates of H H g , t ypically require m uch larger batc h sizes like 10,000. These large batch that H is estimated perfectly but has a poor condition num numb ber. Multiplication by sizes are required to minimize fluctuations in the estimates of H g. Suppose that H is estimated perfectly but has a279 poor condition number. Multiplication by
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
H or its inv inverse erse amplifies pre-existing errors, in this case, estimation errors in g. Very small changes in the estimate of g can th thus us cause large changes in the up update date H or its inv erse amplifies pre-existing errors, in this case, estimation errors in g. −1 erfectly.. Of course, H will be estimated only H g , even if H were estimated perfectly g Very small changes in the estimate cancontain thus cause in the date H −1of g will appro approximately ximately ximately, , so the up update date evenlarge morechanges error than weup would , even if w ere estimated p erfectly . Of course, will b e estimated H g H H predict from applying a poorly conditioned op operation eration to the estimate of g . only approximately, so the update H g will contain even more error than we would It is also crucial that the minibatches be selected randomly randomly.. Computing an predict from applying a poorly conditioned operation to the estimate of g . un unbiased biased estimate of the exp expected ected gradien gradientt from a set of samples requires that those It is also crucial that the minibatches boe subsequent selected randomly Computing samples be indep independent. endent. We also wish for tw two gradient. estimates to an be un biased estimate of the exp ected gradien t from a set of samples requires that those indep independen enden endentt from each other, so tw two o subsequen subsequentt minibatches of examples should samples be indep endent. e also wish Many for twodatasets subsequent gradient estimates to be also be indep independen enden endent t fromW each other. are most naturally arranged indep t from each other, so twoare subsequen t minibatches examples in a wenden ay where successive examples highly correlated. For of example, weshould might also b e indep enden t from each other. Many datasets are most naturally arranged ha hav ve a dataset of medical data with a long list of blo bloo od sample test results. This in a w ay where successive examples are highly correlated. For example, e might list might be arranged so that first we ha hav ve five blo bloo od samples taken atwdifferent have afrom dataset medical datathen withwe a long of blo oodd sample results. times theoffirst patient, ha have ve list three blo bloo samplestest taken fromThis the list might b e arranged so that first we ha v e five blo o d samples taken at different second patient, then the blo bloo od samples from the third patient, and so on. If we times from the first patient, then wethis have od ofsamples taken from the were to draw examples in order from list,three thenblo each our minibatches would second patient, then the blo o d samples from the third patient, and so on. If we be extremely biased, because it would represent primarily one patient out of the w ereytopatien drawtsexamples in order then each ourorder minibatches would man many patients in the dataset. In from casesthis suchlist, as these whereofthe of the dataset be extremely biased, because would represent one patient of the holds some significance, it is it necessary to shuffleprimarily the examples beforeout selecting many patien dataset. In cases such as thesedatasets where the order of the dataset minibatc minibatches. hes.tsFin or the very large datasets, for example con containing taining billions of holds some significance, it is necessary to shuffle the examples b efore selecting examples in a data center, it can be impractical to sample examples truly uniformly minibatc hes. Fortime very we large datasets, for example datasets containing billions of at random ev every ery want to construct a minibatc minibatch. h. Fortunately ortunately, , in practice examples in asufficient data center, it can the be impractical sample once examples truly store uniformly it is usually to shuffle order of thetodataset and then it in at random ev ery time we w ant to construct a minibatc h. F ortunately , in practice sh shuffled uffled fashion. This will imp impose ose a fixed set of possible minibatc minibatches hes of consecutive it is usually sufficient to shuffle the order of the dataset once and then store itdel in examples that all mo models dels trained thereafter will use, and eac each h individual mo model shuffled fashion.toThis willthis impordering ose a fixed set of possible minibatc hes ofthe consecutive will be forced reuse every time it passes through training examples that all mo dels trained thereafter will use, and eac h individual model data. Ho Howev wev wever, er, this deviation from true random selection do does es not seem to hav have ea will b e forced to reuse this ordering every time it passes through the training significan significantt detrimental effect. Failing to ever sh shuffle uffle the examples in any wa way y can data. Ho wev er, this deviation from true random selection do es not seem to hav ea seriously reduce the effectiveness of the algorithm. significant detrimental effect. Failing to ever shuffle the examples in any way can Man Many y optimization problems in machine learning decomp decompose ose over examples seriously reduce the effectiveness of the algorithm. well enough that we can compute entire separate up updates dates over different examples Man y optimization problems in machine learning ose over Jexamples (X ) for in parallel. In other words, we can compute the up update datedecomp that minimizes w ell enough that we can compute entire separate up dates o v er different examples one minibatch of examples X at the same time that we compute the up update date for J (X ) for in parallel. In other words, w e can compute the up date that minimizes sev other minibatches. Suc asynchronous parallel distributed approaches are several eral Such h X one minibatch of examples at the same time that we compute the up date for discussed further in Sec. 12.1.3. several other minibatches. Such asynchronous parallel distributed approaches are An interesting motiv motivation ation for minibatch sto stocchastic gradient descen descentt is that it discussed further in Sec. 12.1.3. follo follows ws the gradien gradientt of the true generalization error (Eq. 8.2) so long as no An interesting motivation minibatch stocof hastic gradient descen t isgradien that itt examples are rep repeated. eated. Most for implementations minibatc minibatch h sto stoc chastic gradient follows the gradient of the true generalization error (Eq. 8.2) so long as no examples are repeated. Most implementations of minibatch stochastic gradient 280
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
descen descentt shuffle the dataset once and then pass through it multiple times. On the first pass, each minibatch is used to compute an un unbiased biased estimate of the true descen t shuffle the dataset once and then pass through it multiple times. On itthe generalization error. On the second pass, the estimate becomes biased because is first pass, each minibatch is used to compute an un biased estimate of the true formed by re-sampling values that hav havee already been used, rather than obtaining generalization error. On the second pass, thedistribution. estimate becomes biased because it is new fair samples from the data generating formed by re-sampling values that have already been used, rather than obtaining The fact that sto stocchastic gradien gradientt descent minimizes generalization error is new fair samples from the data generating distribution. easiest to see in the online learning case, where examples or minibatc minibatches hes are drawn The fact that sto c hastic gradien t descent minimizes generalization error is from a str streeam of data. In other words, instead of receiving a fixed-size training easiest to see in the online learning case, where examples or minibatc hes are drawn set, the learner is similar to a living being who sees a new example at each instant, from every a streexample am of data. other words, of receiving a fixed-size training p data (x, y ). with (x, y)Incoming from theinstead data generating distribution set, the learner is similar to a living b eing who sees a new example at each instant, In this scenario, examples are never rep repeated; eated; every exp experience erience is a fair sample x , y with every example ( ) coming from the data generating distribution p (x, y ). from p data. In this scenario, examples are never repeated; every experience is a fair sample The equiv equivalence alence is easiest to derive when both x and y are discrete. In this from p . case, the generalization error (Eq. 8.2) can be written as a sum The equivalence is easiest to derive when both x and y are discrete. In this XX case, the generalization can J ∗(θerror ) = (Eq. 8.2p)data (xb, ey)written L(f (x; θas), ay)sum , (8.7) J (θ ) = with the exact gradient
x
y
p
(x, y)L(f (x; θ), y),
(8.7)
XX with the exact gradient p data(x, y)∇x L(f (x; θ), y). (8.8) g = ∇θ J ∗ (θ) = XxXy J (θ) = p (8.8) g= (x, y) L(f (x; θ), y). We hav havee already seen log-likelihood d in Eq. 8.5 ∇ the same fact demonstrated∇for the log-likelihoo and Eq. 8.6; we observ observee no now w that this holds for other functions L besides the Weeliho havo e d. already seen result the same demonstrated theylog-likelihoo d in Eq. 8.5 lik likeliho elihoo A similar canfact be derived when xforand are contin continuous, uous, under and Eq. 8.6; we observ e nopw that this holds for other functions L besides the XX mild assumptions regarding data and L. likelihood. A similar result can be derived when x and y are continuous, under Hence, w wee can obtain an un unbiased biased estimator of the exact gradient of the mild assumptions regarding p and L. generalization error by sampling a minibatc minibatch h of examples {x(1) , . . . x(m)} with corHence, w e can obtain an un biased estimator of the exact of the ( i ) pdata,gradient resp responding onding targets y from the data generating distribution and computing generalization by sampling aect minibatc of examplesforxthat with cor, . .minibatch: .x the gradient oferror the loss with resp respect to thehparameters responding targets y from the data generating distribution , and }computing { p X 1 ect the gradient of the loss with resp to the parameters for that minibatch: ˆ = ∇θ g L(f (x(i) ; θ), y (i)). (8.9) m i 1 ˆ= g L(f (x ; θ), y ). (8.9) Up Updating dating θ in the direction ofmgˆ∇performs SGD on the generalization error. Of course, thisdirection interpretation only applies when examples are error. not reused. Updating θ in the of gˆ performs SGD on the generalization Nonetheless, it is usually best to make sev several eral passes through the training set, X Of course, this interpretation only applies examples are notarereused. unless the training set is extremely large. Whenwhen multiple such ep epo ochs used, Nonetheless, est unbiased to make sev eral tpasses the training set, only the first it ep epo oischusually follo follows wsbthe gradien gradient of thethrough generalization error, but unless the training set is extremely large. When multiple such epochs are used, 281gradient of the generalization error, but only the first epoch follows the unbiased
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
of course, the additional ep epo ochs usually provide enough benefit due to decreased training error to offset the harm they cause by increasing the gap bet etw ween training of course, the additional ep o chs usually provide enough b enefit due to decreased error and test error. training error to offset the harm they cause by increasing the gap between training With datasets gro growing wing rapidly in size, faster than computing pow ower, er, it error and some test error. is becoming more common for mac machine hine learning applications to use eac each h training With only some once datasets gro rapidly size, fasterpass thanthrough computing ower, it example or ev even enwing to make aninincomplete the ptraining is becoming more an common for mac hine learningset, applications training set. When using extremely large training ov overfitting erfittingtoisuse noteac anh issue, so example only once or ev en to make an incomplete pass through the training underfitting and computational efficiency become the predominant concerns. See set. Bottou When using an extremely set, of overfitting an issue, so also and Bousquet (2008large ) for training a discussion the effectisofnot computational underfitting andgeneralization computational efficiency ecome predominant concerns. b ottlenec ottlenecks ks on error, as thebnum numb berthe of training examples grows.See also Bottou and Bousquet (2008) for a discussion of the effect of computational bottlenecks on generalization error, as the number of training examples grows.
8.2
Challenges in Neural Net Netw work Optimization
Optimization in generalin is Neural an extremely difficult Traditionally raditionally,, mac machine hine 8.2 Challenges Net worktask. Optimization learning has av avoided oided the difficult difficulty y of general optimization by carefully designing Optimization in general is an extremely difficult task. Traditionallyproblem , machine the ob objectiv jectiv jectivee function and constraints to ensure that the optimization is learning has av oided the difficult y of general optimization by carefully designing con conv vex. When training neural netw networks, orks, we must confront the general non-conv non-convex ex the obEven jectivconv e function and constraints ensureits that the optimization problem is case. convex ex optimization is not to without complications. In this section, con ex. When several trainingofneural netwprominen orks, we m confrontin the general non-convex w ev summarize the most prominent t ust challenges involv volv volved ed in optimization case. Even conv optimization is not without its complications. In this section, for training deepexmo models. dels. we summarize several of the most prominent challenges involved in optimization for training deep models.
8.2.1
Ill-Conditioning
Some arise even when optimizing conv convex ex functions. Of these, the most 8.2.1 challenges Ill-Conditioning prominen prominentt is ill-conditioning of the Hessian matrix H. This is a very general Some challenges even when optimizing conv ex otherwise, functions. and Of these, the ed most problem in most arise numerical optimization, conv convex ex or is describ described in H prominen t is ill-conditioning of the Hessian matrix . This is a very general more detail in Sec. 4.3.1. problem in most numerical optimization, convex or otherwise, and is described in The ill-conditioning problem is generally believ elieved ed to be presen presentt in neural more detail in Sec. 4.3.1. net netw work training problems. Ill-conditioning can manifest by causing SGD to get The generally believed the to cost be presen t in neural “stuc “stuck” k” inill-conditioning the sense that problem even veryissmall steps increase function. network training problems. Ill-conditioning can manifest by causing SGD to get Recall fromsense Eq. that 4.9 that second-order aylor series of the cost “stuc k” in the even avery small stepsTincrease the expansion cost function. function predicts that a gradien gradientt descent step of −g will add Recall from Eq. 4.9 that a second-order Taylor series expansion of the cost > 2 > function predicts that a gradien1t descent − gof g H g step g g will add (8.10) 2 − 1 H g bgecomes ggradient g > Hg to the cost. Ill-conditioning of 2the a problem when 12 2g(8.10) − exceeds g> g. T To o determine whether ill-conditioning is detrimental to a neural g H g to the cost. Ill-conditioning of the gradient b ecomes a problem when g >g and net netw work training task, one can monitor the squared gradien gradient t norm the exceeds g g. To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor282 the squared gradient norm g g and the
16
1.0
14
0.9
Classification error rate
Gradient norm
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
12 10 8 6 4 2 0 −2 −50
0
50 100 150 200 250
Training time (epochs)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
50
100
150
200
250
Training time (epochs)
Figure 8.1: Gradien Gradientt descent often do does es not arrive at a critical p oint of an any y kind. In this example, the gradien gradientt norm increases throughout training of a con convolutional volutional net netw work Figure 8.1: Gradien t descent often do es not arrive at a critical p oint of an y kind. In used for ob object ject detection. (L (Left) eft) A scatterplot showing ho how w the norms of individual this example, the gradien t norm increases throughout training of a con volutional net w ork gradien gradientt ev evaluations aluations are distributed ov over er time. To improv improvee legibilit legibility y, only one gradient (Left) used for ob ject pdetection. A scatterplot hownorms the norms of individual norm is plotted er ep epo och. The running average ofshowing all gradient is plotted as a solid gradien t evaluations are distributed over time. To improv legibilit y, only one gradient curv curve. e. The gradient norm clearly increases ov over er time, rathere than decreasing as we would norm is plotted p er ep o ch. The running a verage of all gradient norms is plotted as a solid exp expect ect if the training pro process cess conv converged erged to a critical point. (Right) Despite the increasing curve. The gradient norm clearly increases ovsuccessful. er time, rather decreasing as we would gradien gradient, t, the training pro process cess is reasonably The than validation set classification expectdecreases if the training pro cess converged to a critical point. (Right) Despite the increasing error to a lo low w level. gradient, the training pro cess is reasonably successful. The validation set classification error decreases to a low level.
g >H g term. In many cases, the gradient norm does not shrink significantly throughout learning, but the g> H g term grows by more than order of magnitude. g Hresult g term. In many cases, the gradient norm does the notpresence shrink significantly The is that learning becomes very slo slow w despite of a strong Hmust g term throughout learning, but the grate bytomore than order magnitude. gradien gradientt because the learning begrows shrunk comp compensate ensate for of even stronger The result is that learning b ecomes very slo w despite the presence of a strong curv curvature. ature. Fig. 8.1 sho shows ws an example of the gradien gradientt increasing significan significantly tly during gradien t b ecause the learning rate must b e shrunk to comp ensate for even stronger the successful training of a neural netw network. ork. curvature. Fig. 8.1 shows an example of the gradient increasing significantly during ill-conditioning is present in other settings besides neural netw network ork the Though successful training of a neural network. training, some of the tec techniques hniques used to com combat bat it in other con contexts texts are less Thoughtoill-conditioning in other settings besides neural netwto ork applicable neural net networks. works.is Fpresent or example, Newton’s metho method d is an excellent tool ol training, some of the tec hniques used to com bat it in other con texts are less for minimizing conv convex ex functions with poorly conditioned Hessian matrices, but in applicable to neural networks. example, Newton’s metho methodd requires is an excellent toolt the subsequent sections we willFor argue that Newton’s significan method significant for dification minimizing convex functions with to poorly conditioned mo modification before it can be applied neural netw networks. orks. Hessian matrices, but in the subsequent sections we will argue that Newton’s method requires significant modification before it can be applied to neural networks.
8.2.2
Lo Local cal Minima
8.2.2of the Lomost cal Minima One prominent features of a conv convex ex optimization problem is that it can be reduced to the problem of finding a lo local cal minim minimum. um. An Any y lo local cal minim minimum um is One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding 283a lo cal minimum. Any lo cal minimum is
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
guaran guaranteed teed to be a global minimum. Some con convex vex functions hav havee a flat region at the bottom rather than a single global minimum poin oint, t, but any point within such teed toisban e a acceptable global minimum. Some convex functions havveex a flat region we at aguaran flat region solution. When optimizing a con conv function, the w bottom a single global minimum oint,a but any ppoint oint of within such kno know that wrather e hav haveethan reac reached hed a go goo od solution if we pfind critical any kind. a flat region is an acceptable solution. When optimizing a convex function, we With non-conv non-convex ex functions, suc such h as neural nets, it is possible to ha have ve man many y know that we have reached a good solution if we find a critical point of any kind. lo local cal minima. Indeed, nearly an any y deep mo model del is essentially guaranteed to ha have ve With non-conv functions, suc as neural nets, it as is pwe ossible to ha ve is man y an extremely largeex num numb ber of lo local cal hminima. How However, ever, will see, this not local minima. Indeed, nearly any deep model is essentially guaranteed to have necessarily a ma major jor problem. an extremely large number of local minima. However, as we will see, this is not Neural net netw works and an any y mo models dels with multiple equiv equivalen alen alently tly parametrized latent necessarily a ma jor problem. variables all hav havee multiple lo local cal minima because of the mo model del identifiability problem. Neural works antifiable y models multiplelarge equivtraining alently parametrized latent A mo model del is net said to beand iden identifiable if awith sufficiently set can rule out all v ariables all hav e m ultiple lo cal minima b ecause of the mo del identifiability problem. but one setting of the mo model’s del’s parameters. Models with latent variables are often A mo del is said to b e iden tifiable if a sufficiently large training can rule out allt not identifiable because we can obtain equiv equivalent alent mo models dels by set exc exchanging hanging laten latent but one setting of the model’s parameters. Models with latentnetw variables aremo often v ariables with each other. For example, we could take a neural network ork and modify dify not identifiable because we can obtain equiv alent mo dels by exc hanging laten la lay yer 1 by sw swapping apping the incoming weigh weightt vector for unit i with the incoming weigh weighttt v ariables with each other. F or example, w e could take a neural netw ork and mo dify vector for unit j , then doing the same for the outgoing weigh weightt vectors. If we hav have e i la y er 1 b y sw apping the incoming weigh t vector for unit with the incoming weigh t m m la lay yers with n units each, then there are n! ways of arranging the hidden units. j , then doing the vectorkind for unit foras the outgoing t vectors. If we have This of non-identifiabilit non-identifiability y issame known weight sp spac ac aceeweigh symmetry symmetry. . m layers with n units each, then there are n! ways of arranging the hidden units. addition to weight space symmetry, man many y kinds neural .net networks works hav havee ThisInkind of non-identifiabilit y issymmetry known as, weight space of symmetry additional causes of non-identifiabilit non-identifiability y. For example, in any rectified linear or In addition to w eight space symmetry , many w kinds neural netof works hav maxout netw network, ork, we can scale all of the incoming eightsofand biases a unit bye additional of of non-identifiabilit y. Fts orby example, in any rectified linear or 1 α if we alsocauses scale all its outgoing weigh weights α. This means that—if the cost maxout netw e can scale ofhthe incoming wyeights biases of a unit by function do does esork, not w include termsallsuc such as weigh weight t deca decay that and dep depend end directly on the α if we scale allthe of mo its dels’ outgoing weights bylo . This meansofthat—if thelinear cost weigh eights ts also rather than models’ outputs—every local cal minimum a rectified function do es not include terms suc h as weigh t deca y that dep end directly on the or maxout netw network ork lies on an (m × n )-dimensional hyperb hyperbola ola of equiv equivalen alen alentt lo local cal weights rather than the models’ outputs—every lo cal minimum of a rectified linear minima. or maxout network lies on an (m n )-dimensional hyperbola of equivalent local These mo model del identifiabilit identifiability y issues mean that there can be an extremely large minima. × or ev even en uncoun uncountably tably infinite amoun amountt of lo local cal minima in a neural netw network ork cost These mo del identifiabilit y issues mean that there can b e an extremely function. Ho Howev wev wever, er, all of these lo local cal minima arising from non-identifiabilit non-identifiability ylarge are or evalen en uncoun tably infinite t ofvalue. local As minima in athese neural netw ork cost equiv equivalen alent t to each other in costamoun function a result, lo local cal minima are function. Ho wev er, all of these lo cal minima arising from non-identifiabilit y are not a problematic form of non-conv non-convexity exity exity.. equivalent to each other in cost function value. As a result, these local minima are Local cal minima can if they hav havee high cost in comparison to the not Lo a problematic formbeofproblematic non-convexity . global minimum. One can construct small neural netw networks, orks, even without hidden Local can be problematic if they e high in minimum comparison(Sontag to the units, thatminima hav havee lo local cal minima with higher costhav than the cost global global minimum. One can construct small neural even hidden and Sussman , 1989 ; Brady et al. al.,, 1989 ; Gori andnetw Tesiorks, , 1992 ). Ifwithout lo local cal minima units, that hav e lo cal minima with higher cost than the global minimum ( Sontag with high cost are common, this could pose a serious problem for gradient-based and Sussmanalgorithms. , 1989; Brady et al., 1989; Gori and Tesi, 1992). If local minima optimization with high cost are common, this could pose a serious problem for gradient-based It remains an op open en question whether there are many lo local cal minima of high cost optimization algorithms. It remains an open question whether284 there are many local minima of high cost
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
for netw networks orks of practical in interest terest and whether optimization algorithms encoun encounter ter them. For man many y years, most practitioners believed that local minima were a for networks of practical interest and whether optimization Talgorithms ter common problem plaguing neural netw network ork optimization. oday day,, thatencoun do does es not them. Forbeman years, practitioners thatoflocal minima were a app appear ear to the ycase. Themost problem remains anbelieved active area researc research, h, but experts common problem plaguing neural netw ork optimization. T o day , that do es not no now w susp suspect ect that, for sufficien sufficiently tly large neural netw networks, orks, most lo local cal minima hav have ea app be the case. an active researc h, but experts lo low w ear costtofunction value,The andproblem that it isremains not imp important ortant toarea find of a true global minimum now susp ect to that, large neural local havcost ea rather than findfor a psufficien oint in tly parameter spacenetw thatorks, has most lo low w but notminima minimal w cost and that it, is2014 not; imp ortant to et findal.a, true minimum (lo Saxe et function al., 2013v;alue, Dauphin et al. Go Goo odfellow 2015global ; Choromansk Choromanska a rather than to find a p oint in parameter space that has lo w but not minimal cost et al. al.,, 2014). (Saxe et al., 2013; Dauphin et al., 2014; Goodfellow et al., 2015; Choromanska Man Many y practitioners attribute nearly all difficulty with neural netw network ork optimizaet al. , 2014 ). tion to lo local cal minima. We encourage practitioners to carefully test for sp specific ecific Man y practitioners attribute nearly all difficulty with neural netw ork optimizaproblems. A test that can rule out lo local cal minima as the problem is to plot the tion to lo cal minima. W e encourage practitioners carefully test specific norm of the gradient ov over er time. If the norm of the to gradient do does es notforshrink to problems. testthe that can rule out local minima thean problem is toofplot the insignifican insignificanttAsize, problem is neither lo local cal minimaasnor any y other kind critical norm thekind gradient over etime. If the norm theminima. gradientIndohigh es not shrink to p oin oint. t. of This of negativ negative test can rule out of lo local cal dimensional insignifican t size, neither loely cal minima anylo other kind of are critical spaces, it can be the veryproblem difficultis to positiv ositively establishnor that local cal minima the p oin t. This kind of negativ e test can rule out lo cal minima. In high dimensional problem. Many structures other than lo local cal minima also hav havee small gradients. spaces, it can be very difficult to positively establish that local minima are the problem. Many structures other than local minima also have small gradients.
8.2.3
Plateaus, Saddle Points and Other Flat Regions
F or many high-dimensional non-conv functions, lo local cal minima 8.2.3 Plateaus, Saddle non-con Pointsvex and Other Flat Regions(and maxima) are in fact rare compared to another kind of point with zero gradient: a saddle Foin or t. many high-dimensional ext functions, local (and maxima) p oint. Some poin oints ts around a non-con saddle pvoin oint hav havee greater costminima than the saddle p oin oint, t, are inothers fact rare compared to another kind of p oint with zero gradient: a saddle while hav a lo cost. A t a saddle p oint, the Hessian matrix has b oth havee lower wer positiv oin t. eSome points around saddlePoin poin have along greatereigenv cost ectors than the saddle oint, p ositive and negative eigen eigenv vaalues. oints tst lying eigenvectors asso associated ciatedpwith while others hav e a lo wer cost. Atcost a saddle point, the Hessian matrix has both positiv ositive e eigenv eigenvalues alues hav have e greater than the saddle point, while poin oints ts lying p ositiv e and negative eigen v alues. P oin ts lying along eigenv ectors asso ciated with along negative eigenv eigenvalues alues ha have ve low lower er value. We can think of a saddle poin oint t as p eigenv alues hav e greater cost than the saddle p oint, while p oin ts lying bositiv eing ae lo minimum along one cross-section of the cost function and a lo local cal local cal along negative alues have lower vSee alue. e can of a saddle point as maxim maximum um alongeigenv another cross-section. Fig.W4.5 for think an illustration. being a local minimum along one cross-section of the cost function and a local Man Many y classes of random functionsSee exhibit thefor following behavior: in lowmaxim um along another cross-section. Fig. 4.5 an illustration. dimensional spaces, lo local cal minima are common. In higher dimensional spaces, lo local cal Manare y classes of saddle random functions exhibit the F following behavior: in Rlowf : Rn → minima rare and points are more common. or a function of dimensional spaces, lo cal minima are common. In higher dimensional spaces, lo cal this type, the exp expected ected ratio of the num numb ber of saddle poin oints ts to lo local cal minima grows R R f : minima are rare and saddle p oints are more common. F or a function of exp exponen onen onentially tially with n. To understand the intuition behind this beha ehavior, vior, observe this type, the exp ected ratio of the num b er of saddle p oin ts to lo cal minima grows → The that the Hessian matrix at a lo local cal minimum has only positiv ositivee eigen eigenv values. exp onen tially with . T o understand the intuition b ehind this b eha vior, observe n Hessian matrix at a saddle poin ointt has a mixture of positive and negativ negativee eigenv eigenvalues. alues. that the that Hessian matrix athaeigenv local alue minimum has only ositive aeigen The Imagine the sign of eac each eigenvalue is generated by p flipping coin.values. In a single Hessian matrix a saddle poinat lo has mixture of ositive and negativ e eigenvheads alues. dimension, it is at easy to obtain local calaminimum byptossing a coin and getting Imagine that the sign of eac h eigenv alue is generated by flipping a coin. In a single once. In n-dimensional space, it is exp exponentially onentially unlikely that all n coin tosses will dimension, it is easy to obtain a local minimum by tossing a coin and getting heads 285 once. In n-dimensional space, it is exponentially unlikely that all n coin tosses will
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
be heads. See Dauphin et al. (2014) for a review of the relev relevant ant theoretical work. An amazing prop property erty of many random functions is that the eigenv eigenvalues alues of the be heads. See Dauphin et al. (2014) for a review of the relevant theoretical work. Hessian become more likely to be positive as we reach regions of lo lower wer cost. In An amazing prop erty of many random functions is that the eigenv alues of the our coin tossing analogy analogy,, this means we are more lik likely ely to hav havee our coin come up Hessian b ecome more likely to b e p ositive as we reach regions of lo wer cost. In heads n times if we are at a critical point with low cost. This means that lo local cal our coinare tossing , this means more ely to havCritical e our coin comewith up minima muc much hanalogy more lik likely ely to hav havee we lo low ware cost thanlikhigh cost. points n times headscost if w e arelikely at a critical point pwith cost. This means local high are far more to be saddle oin oints. ts.low Critical points with that extremely minima h more elybto hav low cost than high cost. Critical points with high costare aremuc more lik likely elylikto e lo local calemaxima. high cost are far more likely to be saddle points. Critical points with extremely happens of maxima. random functions. Do Does es it happ happen en for neural highThis costhapp are ens moreforlikmany ely toclasses be local net netw works? Baldi and Hornik (1989) sho showed wed theoretically that shallo shallow w autoenco autoencoders ders This happ ens for many classes of random functions. Do es it happ en for neural (feedforw (feedforward ard net networks works trained to cop copy y their input to their output, describ described ed in net w orks? Baldi and Hornik ( 1989 ) sho wed theoretically that shallo w autoenco Chapter 14) with no nonlinearities hav havee global minima and saddle points butders no (feedforw ard net works trained to cop y their input to their output, describ ed in lo local cal minima with higher cost than the global minimum. They observed without Chapter 14)these withresults no nonlinearities haveerglobal minima and saddle points butThe no pro proof of that extend to deep deeper net networks works without nonlinearities. local minima with higher cost than the globalofminimum. They output of suc such h netw networks orks is a linear function their input, butobserved they arewithout useful pro of that these results extend to deep er net works without nonlinearities. The to study as a mo model del of nonlinear neural netw networks orks because their loss function is output of suc h netw orks is a linear function of their input, but they are useful a non-conv non-convex ex function of their parameters. Such netw networks orks are essentially just toultiple study matrices as a model of osed nonlinear neural netw because their loss function is m comp composed together. Saxe et orks al. (2013 ) pro provided vided exact solutions a non-conv ex function their parameters. Such are essentially to the complete learningofdynamics in suc such h netw networks orksnetw andorks show showed ed that learningjust in m ultiple matrices comp osed together. Saxe et al. ( 2013 ) pro vided exact solutions these mo models dels captures man many y of the qualitativ qualitativee features observed in the training of to the complete learning dynamics in suc h networksDauphin and show learning in deep mo models dels with nonlinear activ activation ation functions. eted al.that (2014 ) sho showed wed these models man y of netw the qualitativ features observed that in thecontain training of exp experimen erimen erimentally tallycaptures that real neural networks orks also ehav have e loss functions very deepymo dels with nonlinear activChoromansk ation functions. Dauphin al. (2014 ) showed man many high-cost saddle poin oints. ts. Choromanska a et al. (2014) et provided additional experimentally that real showing neural netw orks also hav e loss that contain very theoretical arguments, that another class of functions high-dimensional random man y high-cost saddle p oin ts. Choromansk a et al. ( 2014 ) provided additional functions related to neural net netw works do does es so as well. theoretical arguments, showing that another class of high-dimensional random What are the implications the proliferation of saddle points for training algofunctions related to neural netof works does so as well. rithms? For first-order optimization algorithms that use only gradient information, are the implications of the proliferation saddlevery points for near training algothe What situation is unclear. The gradient can often bofecome small a saddle rithms? F or first-order optimization algorithms that use only gradient information, poin oint. t. On the other hand, gradient descent empirically seems to be able to escap escapee the situation The gradient canetoften ecome very small near a saddle saddle points isinunclear. man many y cases. Goo Goodfellow dfellow al. (b2015 ) provided visualizations of p oin t. On the other hand, gradient descent empirically seems to b e able to escap sev several eral learning tra trajectories jectories of state-of-the-art neural net netw works, with an examplee saddle in .man y cases. Goodfellow et aal.flattening (2015) provided visualizations of giv given en inpoints Fig. 8.2 These visualizations show of the cost function near sev eral learning tra jectories of state-of-the-art neural net w orks, with an example a prominent saddle poin ointt where the weigh eights ts are all zero, but they also show the given int Fig. 8.2.tra These visualizations show athis flattening thedfellow cost function near) gradien gradient descent trajectory jectory rapidly escaping region. of Goo Goodfellow et al. (2015 a prominent saddle poin t wheregradient the weigh ts are ma all yzero, but they also show also argue that con contin tin tinuous-time uous-time descent may be shown analytically tothe be gradien t descent tra jectory rapidly escaping this region. Goo dfellow et al. ( 2015 rep repelled elled from, rather than attracted to, a nearb nearby y saddle point, but the situation) also argue that con tin uous-time gradient descent madescen y be shown analytically to be ma may y be differen differentt for more realistic uses of gradient descent. t. repelled from, rather than attracted to, a nearby saddle point, but the situation metho method, it is clear saddle descen pointst.constitute a problem. mayFor be Newton’s different for mored,realistic uses that of gradient For Newton’s method, it is clear 286 that saddle points constitute a problem.
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
J(θ ) Pro jec
tion 1 of θ
Pro
n ctio
fθ 2o
je
Figure 8.2: A visualization of the cost function of a neural netw network. ork. Image adapted with p ermission from Go Goo o dfello dfellow w et al. (2015). These visualizations app appear ear similar for Figure 8.2: visualization thevolutional cost function of a neural network.netw Image feedforw feedforward ard A neural net networks, works,ofcon convolutional net networks, works, and recurrent networks orks adapted applied et al. with p ermission from Go o dfello w ( 2015 ). These visualizations app ear similar for to real ob object ject recognition and natural language pro processing cessing tasks. Surprisingly Surprisingly,, these feedforward neural netdo works, con networks,obstacles. and recurrent orks applied visualizations usually not sho show wvolutional many conspicuous Priornetw to the success of to real ob ject recognition promo cessing Surprisingly sto stoc chastic gradient descentand for natural traininglanguage very large models dels btasks. eginning in roughly, these 2012, visualizations dosurfaces not show many conspicuous obstacles. Prior to thenon-conv success ex of neural net costusually function were generally believed to hav havee muc much h more non-convex sto chasticthan gradient descent training very large dels b eginning roughlyby2012, structure is rev revealed ealed byfor these pro projections. jections. Themo primary obstacle in revealed this neural net is cost function surfaces believed to have muc more non-conv pro projection jection a saddle point of highwere costgenerally near where the parameters arehinitialized, but, ex as structure than is rev ealed by these pro jections. The primary obstacle revealed b y this indicated by the blue path, the SGD training tra trajectory jectory escap escapes es this saddle point readily readily.. pro jection is a saddle oint of cost near the parameters but, as Most of training timepis sp spen en entthigh trav traversing ersing thewhere relatively flat valleyare of initialized, the cost function, indicated path,noise the SGD tra escapes this point matrix readily. whic which h maybybethe dueblue to high in thetraining gradien gradient, t, pjectory oor conditioning of saddle the Hessian Most of training time is sp en t trav ersing the relatively flat v alley of the cost function, in this region, or simply the need to circumnavigate the tall “mountain” visible in the whic h may beindirect due to high noise in the gradient, poor conditioning of the Hessian matrix figure via an arcing path. in this region, or simply the need to circumnavigate the tall “mountain” visible in the figure via an indirect arcing path.
287
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Gradien Gradientt descent is designed to mov movee “do “downhill” wnhill” and is not explicitly designed to seek a critical point. Newton’s metho method, d, how however, ever, is designed to solv solvee for a Gradien t descent is designed to mov e “do wnhill” and is not explicitly designed poin ointt where the gradien gradientt is zero. Without appropriate mo modification, dification, it can jump to seek a critical p oint. Newton’s metho d, how ever, is designed to solv e for a to a saddle point. The proliferation of saddle poin oints ts in high dimensional spaces p oin t where the gradien t is zero. Without appropriate mo dification, it can jump presumably explains why second-order metho methods ds hav havee not succeeded in replacing to a saddle point.forThe proliferation of saddle pDauphin oints in high spaces gradien gradient t descent neural netw network ork training. et al.dimensional (2014) introduced second-order methodsoptimization have not succeeded ined replacing apresumably sadd saddle-fr le-fr le-freeeexplains Newtonwhy metho method d for second-order and show showed that it gradien t descent for neural netw ork training. Dauphin et al. ( 2014 ) introduced impro improv ves significantly over the traditional version. Second-order metho methods ds remain a sadd le-fr e e Newton metho d for second-order optimization and show ed it difficult to scale to large neural net networks, works, but this saddle-free approac approach hthat holds improvesifsignificantly over the traditional version. Second-order methods remain promise it could be scaled. difficult to scale to large neural networks, but this saddle-free approach holds There are other kinds of points with zero gradient besides minima and saddle promise if it could be scaled. poin oints. ts. There are also maxima, whic which h are muc much h lik likee saddle poin oints ts from the There are other kinds of p oints with zero gradient besides minima saddle persp erspectiv ectiv ectivee of optimization—many algorithms are not attracted to and them, but p oints.dified There are alsomethod maxima, h are bmuc h lik e saddle poinrare ts from the unmo unmodified Newton’s is. whic Maxima ecome exp exponentially onentially in high perspective of optimization—many algorithms are not attracted to them, but dimensional space, just like minima do. unmodified Newton’s method is. Maxima become exponentially rare in high There may also be wide, flat regions of constant value. In these lo locations, cations, the dimensional space, just like minima do. gradien gradientt and also the Hessian are all zero. Suc Such h degenerate lo locations cations pose ma major jor There for mayallalso be wide,optimization flat regions algorithms. of constant vIn alue. In ex these locations, the problems numerical a conv convex problem, a wide, gradien t and alsoconsist the Hessian areofall zero. minima, Such degenerate locationsoptimization pose ma jor flat region must en entirely tirely global but in a general problems such for alla numerical optimization In aofconv problem, a wide, problem, region could corresp correspond ond algorithms. to a high value theexob objectiv jectiv jectivee function. flat region must consist entirely of global minima, but in a general optimization problem, such a region could correspond to a high value of the ob jective function.
8.2.4
Cliffs and Explo Exploding ding Gradients
8.2.4 netw Cliffs ding Neural networks orksand withExplo many lay layers ers Gradients often hav havee extremely steep regions resembling cliffs, as illustrated in Fig. 8.3. These result from the multiplication of several large Neural orks with many layof ersanoften have extremely regionsthe resembling w eigh eights ts netw together. On the face extremely steep cliffsteep structure, gradient cliffs, as illustrated in Fig. 8.3 . These result from the m ultiplication of several up update date step can mo mov ve the parameters extremely far, usually jumping off oflarge the w eigh ts together. On the face of an extremely steep cliff structure, the gradient cliff structure altogether. update step can move the parameters extremely far, usually jumping off of the cliff structure altogether.
288
J(w; b)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
w b
Figure 8.3: The ob objectiv jectiv jectivee function for highly nonlinear deep neural netw networks orks or for recurren recurrentt neural netw networks orks often contains sharp nonlinearities in parameter space resulting Figurethe 8.3: The ob jectivofe several functionparameters. for highly nonlinear deep neuralgive netwrise orkstoorvery for from multiplication These nonlinearities recurren t neural netw orks often contains sharp nonlinearities in parameter space resulting high deriv derivatives atives in some places. When the parameters get close to such a cliff region, a from the multiplication of several These give rise gradien gradient t descent up update date can catapultparameters. the parameters verynonlinearities far, possibly losing mosttoofvery the high derivatives in that somehad places. parameters close to such afrom cliff region, a optimization work b eenWhen done.the Figure adaptedget with p ermission Pascanu gradien t descent up date can catapult the parameters v ery far, p ossibly losing most of the et al. (2013a). optimization work that had b een done. Figure adapted with p ermission from Pascanu et al. (2013a).
The cliff can be dangerous whether we approach it from ab abo ove or from below, but fortunately its most serious consequences can be avoided using the gr gradient adient The cliff can bedescrib dangerous we approach it from ve recall or from below, clipping heuristic described ed in whether Sec. 10.11.1 . The basic ideaab is oto that the but fortunately its most serious consequences can b e a v oided using the gr adient gradien gradientt do does es not sp specify ecify the optimal step size, but only the optimal direction clipping heuristic describ ed in When Sec. 10.11.1 . The basic idea isdescen to recall that the within an infinitesimal region. the traditional gradient descent t algorithm gradien t do es not sp ecify the optimal step size, but only the optimal direction prop proposes oses to make a very large step, the gradien gradientt clipping heuristic interv intervenes enes to within When the gradient t algorithm reduce an theinfinitesimal step size to bregion. e small enough thattraditional it is less likely to godescen outside the region proposes make aindicates very large the gradien t clipping heuristic intervenes to where the to gradient thestep, direction of approximately steep steepest est descent. Cliff reduce the are stepmost size to b e smallinenough that it is lessfor likely to go outside region structures common the cost functions recurrent neural the netw networks, orks, where the gradient indicates the direction of approximately steep est descent. Cliff because suc such h mo models dels inv involve olve a multiplication of many factors, with one factor structures are step. most common inoral the cost functions recurrent neuralamount networks, for eac each h time Long temp temporal sequences th thus usfor incur an extreme of b ecause suc h mo dels inv olve a multiplication of many factors, with one factor multiplication. for each time step. Long temporal sequences thus incur an extreme amount of multiplication.
8.2.5
Long-T Long-Term erm Dep Dependencies endencies
8.2.5 difficulty Long-Tthat ermneural Dep endencies Another netw network ork optimization algorithms must overcome arises when the computational graph becomes extremely deep. F Feedforward eedforward netw networks orks Another difficulty that netwcomputational ork optimization algorithms ust overcome arises with many lay layers ers ha hav ve neural such deep graphs. So domrecurrent net networks, works, when the extremely deep. Feedforwardgraphs networks describ described ed computational in Chapter 10,graph whic which hbecomes construct very deep computational by with many layers have such deep computational graphs. So do recurrent networks, described in Chapter 10, which construct very deep computational graphs by 289
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
rep repeatedly eatedly applying the same op operation eration at each time step of a long temp temporal oral sequence. Rep Repeated eated application of the same parameters gives rise to esp especially ecially repeatedly applying the same operation at each time step of a long temporal pronounced difficulties. sequence. Repeated application of the same parameters gives rise to especially For example, supp suppose ose that a computational graph con contains tains a path that consists pronounced difficulties. of rep repeatedly eatedly multiplying by a matrix W . After t steps, this is equiv equivalent alent to mulFor example, suppose that a computational graph con tainsW a path that( consists = V diag λ)V −1. tiplying by W t . Suppose that W has an eigendecomp eigendecomposition osition diag( Wto t steps, this is equivalent to mulof rep by a matrixard . After In thiseatedly simplemultiplying case, it is straightforw straightforward see that W W tiplying by . Suppose that has an eigendecomp osition W = V diag( λ)V . −1 t t V diag diag((λ)V = Vthat diag(λ) tV −1. (8.11) In this simple case,W it is=straightforw ard to see
An Any y eigenv eigenvalues alues λi W that=areVnot value explode de if V 1 will diagnear (λ)Van absolute = V diag (λ) of . either explo (8.11) they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude. λ that An eigenvaluesand are near anpr absolute valuetoofthe 1 will de if They vanishing explo exploding dingnot gr gradient adient problem oblem refers facteither that explo gradients they are greater than 1are in magnitude vanish if to they are than 1 in magnitude. ) t. V ( λless through such a graph also scaledoraccording diag diag( anishing gradients The explo adient problem refers to should the factmo that grdirection mak makeevanishing it difficultand to kno know w ding which the parameters mov ve togradients impro improv ve ( λ ) through such a graph are also scaled according to diag . V anishing gradients the cost function, while explo exploding ding gradients can make learning unstable. The cliff mak e it difficult to kno w which parameters should moexample ve to impro ve structures describ described ed earlier that direction motiv motivate ate the gradien gradient t clipping are an of the the cost exploding gradients can make learning unstable. The cliff explo exploding dingfunction, gradientwhile phenomenon. structures described earlier that motivate gradient clipping are an example of the The repeated each h time step described here is very explo ding gradientmultiplication phenomenon. by W at eac similar to the power metho method d algorithm used to find the largest eigen eigenv value of a matrix W The repeated multiplication by at eac h time step described here is very W and the corresp corresponding onding eigenv eigenvector. ector. From this point of view it is not surprising similar power metho algorithm toonents find the largest eigen value of a matrix t will x>to Wthe that even eventually tuallyd discard allused comp components of x that are orthogonal to the W and the corresp onding ector. From this point of view it is not surprising principal eigenv eigenvector ector of Weigenv . that x W will eventually discard all components of x that are orthogonal to the Recurren Recurrent t net netw works each h time step, but feedforward principal eigenv ector of use W .the same matrix W at eac net netw works do not, so even very deep feedforward netw networks orks can largely avoid the W Recurren w orks use the same matrix at eac h time vanishing andt net explo gradient problem ( Sussillo , 2014 ). step, but feedforward exploding ding networks do not, so even very deep feedforward networks can largely avoid the We defer a further discussion of the challenges of training recurrent net netw works vanishing and exploding gradient problem (Sussillo, 2014). un until til Sec. 10.7, after recurren recurrentt netw networks orks ha have ve been describ described ed in more detail. We defer a further discussion of the challenges of training recurrent networks until Sec. 10.7, after recurrent networks have been described in more detail.
8.2.6
Inexact Gradien Gradients ts
Most algorithmstsare primarily motiv motivated ated by the case where we ha hav ve 8.2.6optimization Inexact Gradien exact knowledge of the gradient or Hessian matrix. In practice, we usually only Most are primarily motivated by the case we have ha hav ve aoptimization noisy or ev even enalgorithms biased estimate of these quantities. Nearly everywhere deep learning exact knowledge of the gradient or Hessian matrix. In practice, we usually only algorithm relies on sampling-based estimates at least insofar as using a minibatch havtraining e a noisyexamples or even biased estimate these quantities. Nearly every deep learning of to compute theofgradien gradient. t. algorithm relies on sampling-based estimates at least insofar as using a minibatch In other cases, the ob objectiv jectiv jectivee function we wan antt to minimize is actually in intractable. tractable. of training examples to compute the gradient. When the ob objective jective function is intractable, typically its gradien gradientt is in intractable tractable as In other cases, the ob jectiv e function w e w an t to minimize is actually intractable. well. In such cases we can only appro approximate ximate the gradient. These issues mostly arise When the ob jective function is intractable, typically its gradient is intractable as 290 the gradient. These issues mostly arise well. In such cases we can only approximate
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
with the more adv advanced anced mo models dels in Part III. For example, con contrastiv trastiv trastivee divergence giv gives es a tec technique hnique for appro approximating ximating the gradient of the intractable log-lik log-likeliho eliho elihoo od with the more adv anced mo dels in Part I I I . F or example, con trastiv e divergence of a Boltzmann machine. gives a technique for approximating the gradient of the intractable log-likelihood arious neural netw network ork optimization algorithms are designed to account for of aVBoltzmann machine. imp imperfections erfections in the gradien gradientt estimate. One can also avoid the problem by cho hoosing osing Various loss neural network optimization algorithms are designed to loss. account for a surrogate function that is easier to approximate than the true imperfections in the gradient estimate. One can also avoid the problem by choosing a surrogate loss function that is easier to approximate than the true loss.
8.2.7
Po or Corresp Correspondence ondence bet etw ween Lo Local cal and Global Structure
Man Many y of P the weondence ha hav ve discussed so far correspond ondGlobal to prop properties erties of the 8.2.7 o orproblems Corresp between Locorresp cal and Structure loss function at a single point—it can be difficult to make a single step if J (θ ) is Man y of the problems wecurrent have discussed so iffarθ corresp toor prop of the poorly conditioned at the poin lies on aond cliff, if θerties is a saddle ointt θ, or loss at aopp single point—it can progress be difficult to make a single step if J (θ ) is p oin ointtfunction hiding the opportunity ortunity to make downhill from the gradient. poorly conditioned at the current point θ, or if θ lies on a cliff, or if θ is a saddle It is possible to ov overcome ercome all of these problems at a single point and still point hiding the opportunity to make progress downhill from the gradient. perform poorly if the direction that results in the most impro improv vemen ementt lo locally cally do does es is pto ossible to ov ercome all of these problems at a single point and still not It point distant regions of muc low cost. toward ward much h lower er perform poorly if the direction that results in the most improvement locally does Goo odfello dfellow w et distant al. (2015 ) argueofthat much her ofcost. the run runtime time of training is due to not Go point toward regions mucmuc h low the length of the tra trajectory jectory needed to arrive at the solution. Fig. 8.2 sho shows ws that Go o dfello w et al. ( 2015 ) argue that muc h of the run time of training is due to the learning tra trajectory jectory sp spends ends most of its time tracing out a wide arc around a the length of the tra jectory needed to arrive at the solution. Fig. 8.2 sho ws that moun structure. mountain-shap tain-shap tain-shaped ed the learning tra jectory spends most of its time tracing out a wide arc around a Muc Much h of research in into to the difficulties of optimization has fo focused cused on whether moun tain-shap ed structure. training arrives at a global minimum, a lo local cal minimum, or a saddle point, but in Muc h of research in to the difficulties of cused on whether practice neural netw networks orks do not arrive at aoptimization critical pointhas of fo any kind. Fig. 8.1 training arrives a global minimum, local at minimum, or small a saddle point,Indeed, but in sho shows ws that neuralatnetw networks orks often do notaarrive a region of gradient. practice neural netwdo orks notnecessarily arrive at aexist. critical of any Fig. 8.1 suc such h critical points notdo even Forpoint example, thekind. loss function sholog wspthat netwlack orks aoften do not arrive atpoint a region small gradient. Indeed, (y | neural − x; θ) can global minimum andofinstead asymptotically suc h critical p oints do not even necessarily exist. F or example, the loss function approac approach h some value as the mo model del becomes more confident. For a classifier with log p(yy and x; θp) (can a globalbyminimum point instead asymptotically discrete ) provided a softmax, the and negative log-lik log-likeliho eliho elihoo od can y | xlack approac h some v alue as the mo del b ecomes more confident. F or a classifier with − | become arbitrarily close to zero if the mo model del is able to correctly classify ev every ery discrete and ) provided by a softmax, the negative log-lik eliho o d can y x y p ( example in the training set, but it is imp impossible ossible to actually reach the value of b ecome arbitrarily close to zero if the mo classify every | = is (y ; f (to θ), correctly β −1) can ha N able zero. Likewise, a mo model del of real values p( y | x)del hav ve negative example inothe training set, buttoitnegativ is impeossible to actually value of f (θ) isreach log-lik d that asymptotes infinity—if able the to correctly log-likeliho eliho elihoo negative ( y x) =the learning (y ; f (θ), βalgorithm zero. Likewise, a mo of real values ) can ha ve negative predict the value of del all training set y ptargets, will increase f ( θ log-lik eliho o d that asymptotes to negativ e infinity—if ) is able to correctly | N local cal optimization to β without bound. See Fig. 8.4 for an example of a failure of lo y predict the v alue of all training set targets, the learning algorithm willorincrease find a go goo od cost function value even in the absence of any lo local cal minima saddle without b ound. See Fig. 8.4 for an example of a failure of lo cal optimization to β poin oints. ts. find a good cost function value even in the absence of any local minima or saddle Future research will need to develop further understanding of the factors that points. influence the length of the learning tra trajectory jectory and better characterize the outcome Future research will need to develop further understanding of the factors that 291 influence the length of the learning tra jectory and better characterize the outcome
J(θ )
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
θ
Figure 8.4: Optimization based on lo local cal do downhill wnhill mov moves es can fail if the lo local cal surface do does es not point tow toward ard the global solution. Here we provide an example of how this can occur, Figure Optimization based on loand cal do es canThis fail ifexample the localcost surface does ev even en if 8.4: there are no saddle p oints nownhill lo local cal mov minima. function not p oint tow ard the global solution. Here w e provide an example of how this can o ccur, con contains tains only asymptotes to toward ward low values, not minima. The main cause of difficult difficulty y in evencase if there are initialized no saddle on p oints and noside lo cal This example this is b eing the wrong of minima. the “mountain” and not cost b eingfunction able to converse tains it. onlyInasymptotes toward low values, not minima. The can mainoften causecircumnavigate of difficulty in tra traverse higher dimensional space, learning algorithms this case is b eing initialized on the wrong side of the “mountain” and not eingresult able to suc such h mountains but the tra trajectory jectory asso associated ciated with doing so may b e long band in tra verseeit. In higher space, learning excessiv excessive training time,dimensional as illustrated in Fig. 8.2. algorithms can often circumnavigate such mountains but the tra jectory asso ciated with doing so may b e long and result in excessive training time, as illustrated in Fig. 8.2.
of the pro process. cess. Man Many y existing research directions are aimed at finding go goo od initial points for of the process. problems that hav havee difficult global structure, rather than developing algorithms Man y existing that use non-lo non-local cal research mov moves. es. directions are aimed at finding good initial points for problems that have difficult global structure, rather than developing algorithms Gradien Gradientt descent and essentially all learning algorithms that are effective for that use non-local moves. training neural net netwo wo works rks are based on making small, lo local cal mov moves. es. The previous Gradien t descent and essentially all learning algorithms that are effective for sections hav havee primarily fo focused cused on how the correct direction of these lo local cal mov moves es training neural net wo rks are based on making small, lo cal mov es. The previous can be difficult to compute. We may be able to compute some prop properties erties of the sections hav e primarily fo cused on how the correct direction of these moves ob objectiv jectiv jectivee function, such as its gradient, only approximately approximately,, with bias lo orcal variance canour be estimate difficult of to the compute. e may beIn able to compute some properties they in correctW direction. these cases, lo local cal descent ma may y orofma may ob jectiv e function, such as its gradient, approximately , with biasnot or vactually ariance not define a reasonably short path to aonly valid solution, but we are in ourtoestimate of the correct direction. In these cases, local descent yeorissues may able follow the lo local cal descent path. The ob objectiv jectiv jective e function may ma ha hav v not define a reasonably short path to a v alid solution, but we are not actually suc such h as poor conditioning or discontin discontinuous uous gradien gradients, ts, causing the region where ablegradient to follow the local descent path. The ob jectiv e function hasmall. ve issues the pro provides vides a go goo od mo model del of the ob objectiv jectiv jective e function to bmay e very In suc h as p o or conditioning or discontin uous gradien ts, causing the region where these cases, lo local cal descent with steps of size ma may y define a reasonably short path the gradient pro vides a go o d mo del of the ob jectiv e function to be direction very small. In to the solution, but we are only able to compute the lo local cal descent with these cases, lo cal descent with steps of size ma y define a reasonably short path steps of size δ . In these cases, lo local cal descent ma may y or ma may y not define a path to the the solution, solution, but but the we are only able toman compute cal descent with to path contains many y steps,the so lo following thedirection path incurs a . In these cases, local descent may or may not define a path steps of size δ to the solution,but the path contains man 292 y steps, so following the path incurs a
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
high computational cost. Sometimes lo local cal information provides us no guide, when the function has a wide flat region, or if we manage to land exactly on a critical high cost. Sometimes local information provides us solve no guide, when p oin ointtcomputational (usually this latter scenario only happ happens ens to metho methods ds that explicitly the critical functionpoints, has a such wide as flatNewton’s region, ormetho if wed). manage to land on a critical for method). In these cases,exactly lo local cal descent do does es p oin t (usually this latter scenario only happ ens to metho ds that solve explicitly not define a path to a solution at all. In other cases, lo local cal mov moves es can be to tooo greedy for critical oints, asuch Newton’s d). but In these localsolution, descent as doin es and lead uspalong pathasthat mov moves es metho downhill awa way y cases, from any not define path to solution at all.long In other cases, to lo cal es can as bein tooFig. greedy Fig. 8.4, ora along ana unnecessarily tra trajectory jectory themov solution, 8.2. and lead us along a path that mov es downhill but a wa y from any solution, as in Curren Currently tly tly,, we do not understand which of these problems are most relev relevan an antt to Fig. 8.4 , or along an unnecessarily long tra jectory to the solution, as in Fig. 8.2 making neural netw network ork optimization difficult, and this is an active area of researc research. h.. Currently, we do not understand which of these problems are most relevant to Regardless whic which of these problems are most significant, all ofarea them e making neural of netw orkh optimization difficult, and this is an active of might researcbh. avoided if there exists a region of space connected reasonably directly to a solution whic h of these problems allinitialize of them might be by aRegardless path thatoflo local cal descen descent t can follow, are andmost if wesignificant, are able to learning a v oided if there exists a region of space connected reasonably directly to a solution within that well-behav well-behaved ed region. This last view suggests research into choosing b yoad path lotscal t can optimization follow, and ifalgorithms we are able initialize learning go goo initialthat poin oints fordescen traditional toto use. within that well-behaved region. This last view suggests research into choosing good initial points for traditional optimization algorithms to use.
8.2.8
Theoretical Limits of Optimization
8.2.8 Theoretical Limits Optimization Sev Several eral theoretical results sho show w of that there are limits on the performance of an any y optimization algorithm we migh mightt design for neural net netw works (Blum and Rivest, Sev eral theoretical results sho w that there are limits on the pthese erformance an 1992; Judd, 1989; Wolpert and MacReady, 1997). Typically resultsofha hav vey optimization algorithm might netw design forinneural networks (Blum and Rivest, little bearing on the use w ofe neural networks orks practice. 1992; Judd, 1989; Wolpert and MacReady, 1997). Typically these results have Some theoretical results apply only to the case where the units of a neural little bearing on the use of neural networks in practice. net netw work output discrete values. Ho How wev ever, er, most neural net network work units output Some theoretical results apply only to the case where the units of a neural smo smoothly othly increasing values that make optimization via lo local cal searc search h feasible. Some network output values. wev er, mostclasses neuralthat netare work units output theoretical resultsdiscrete sho show w that there Ho exist problem in intractable, tractable, but smo othly increasing v alues that make optimization via lo cal searc h feasible. Some it can be difficult to tell whether a particular problem falls into that class. Other theoretical show that there exist classes that are tractable, but results showresults that finding a solution for a problem netw network ork of a given size is in intractable, but it can b e difficult to tell whether a particular problem falls into that class. Other in practice we can find a solution easily by using a larger netw network ork for which man many y resultsparameter show that settings finding acorresp solution netw ork of a given size isMoreov intractable, more correspond ondfortoa an acceptable solution. Moreover, er, in but the in practice we cannetw findork a solution by using larger orkfinding for which con context text of neural network training,easily we usually do anot carenetw ab about out the man exacty more parameter settings corresp ond to an acceptable solution. Moreov er, in the minim minimum um of a function, but only in reducing its value sufficiently to obtain go goo od con text of neural netw ork training, we usually do not care ab out finding the exact generalization error. Theoretical analysis of whether an optimization algorithm minim um of a function, only in reducing value sufficiently to obtain good can accomplish this goalbut is extremely difficult.itsDev Developing eloping more realistic bounds generalization error. ofTheoretical analysis of whether an optimization algorithm on the performance optimization algorithms therefore remains an important can accomplish goal isresearc extremely difficult. Developing more realistic bounds goal for mac machine hinethis learning research. h. on the performance of optimization algorithms therefore remains an important goal for machine learning research.
293
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.3
Basic Algorithms
W e hav haveeBasic previously introduced the gradien gradientt descent (Sec. 4.3) algorithm that 8.3 Algorithms follo follows ws the gradient of an entire training set downhill. This may be accelerated W e hav e previously the gradien t descent (Sec. 4.3) algorithm that considerably by usingintroduced sto stocchastic gradien gradient t descen descent t to follow the gradien gradient t of randomly follo ws the gradient of an entire training set downhill. This may b e accelerated selected minibatches do downhill, wnhill, as discussed in Sec. 5.9 and Sec. 8.1.3. considerably by using stochastic gradient descent to follow the gradient of randomly selected minibatches downhill, as discussed in Sec. 5.9 and Sec. 8.1.3.
8.3.1
Sto Stoc chastic Gradient Descent
Sto Stoc chastic gradient (SGD) and its varian ariants ts are probably the most used 8.3.1 Sto chasticdescent Gradient Descent optimization algorithms for machine learning in general and for deep learning in Stochastic gradient descent (SGD) variants probably the most used particular. As discussed in Sec. 8.1.3and , it isitspossible to are obtain an unbiased estimate optimization for machine learningon inageneral and learning in m deep of the gradien gradienttalgorithms by taking the average gradient minibatc minibatch h offor examples dra drawn wn particular. Asdata discussed in Sec.distribution. 8.1.3, it is possible to obtain an unbiased estimate i.i.d from the generating of the gradient by taking the average gradient on a minibatch of m examples drawn Algorithm 8.1 sho shows ws how to follow this estimate of the gradien gradientt downhill. i.i.d from the data generating distribution. Algorithm 8.1Sto sho ws how to follow thistestimate ofdate the at gradien t downhill. Stoc chastic gradient descen descent (SGD) up update training iteration k Algorithm 8.1 Require: Learning . Algorithm 8.1 Stocrate hastick gradient descent (SGD) update at training iteration k Require: Initial parameter θ Require: Learning rate .not met do while stopping criterion Require: parameter SampleInitial a minibatch of mθexamples from the training set {x(1), . . . , x (m)} with while stopping (i). met do corresp corresponding ondingcriterion targets ynot P training(i)set x (i), . . . , x Sample a minibatch of m examples the with 1 Compute gradient estimate: gˆ ← +from i L(f (x ; θ ), y ) m∇ θ correspup onding y g { } ˆ. Apply update: date: targets θ←θ− ˆ L ( f ( x ; θ ) , y ) Compute gradient estimate: g + end while ˆ Apply update: θ θ g ← ∇ end while ← − A crucial parameter for the SGD algorithm is the learning rate. Previously Previously,, we ha hav ve describ described ed SGD as using a fixed learning rate . In practice, it is necessary to Pis the learning rate. Previously, we A crucial parameter for the SGD algorithm gradually decrease the learning rate ov over er time, so we now denote the learning rate haviteration e describkedasSGD at k. as using a fixed learning rate . In practice, it is necessary to gradually decrease the learning rate over time, so we now denote the learning rate This is because the SGD gradien gradientt estimator in intro tro troduces duces a source of noise (the at iteration k as . random sampling of m training examples) that do does es not vanish ev even en when we arrive This is b ecause the SGD gradien t estimator in tro duces a source of noise (the at a minim minimum. um. By comparison, the true gradien gradientt of the total cost function becomes random sampling examples) doaesminimum not vanishusing even batch when we arrive m training small and then 0 of when we approach andthat reach gradient at a minim By comparison, the true of thelearning total costrate. function becomest descen descent, t, soum. batch gradient descent cangradien use at fixed A sufficien sufficient 0 when small and to then we approach reach a minimum using batch gradient condition guaran guarantee tee conv convergence ergence ofand SGD is that descent, so batch gradient descent can use a fixed learning rate. A sufficient ∞ X condition to guarantee convergence of SGD is that (8.12) k = ∞, and k=1
X
=294 , ∞
and
(8.12)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
∞ X k=1
2k < ∞.
(8.13)
< . (8.13) In practice, it is common to decay the until til iteration τ : ∞ learning rate linearly un In practice, it is common to τ: (8.14) k decay = (1 −the α)learning 0 + α τ rate linearly until iteration X (8.14) + α α) to with α = kτ . After iteration τ , it = is (1 common lea leav ve constant. −by trial and error, but it is usually best may b withThe α =learning . Afterrate iteration τ e, itchosen is common to leave constant. to choose it by monitoring learning curves that plot the ob objective jective function as a The learning rate may b e chosen b y trial and error, but is usually est function of time. This is more of an art than a science, and mostitguidance on bthis to choose it by monitoring learning curves that plot the ob jective function as a sub subject ject should be regarded with some skepticism. When using the linear schedule, function of time.toThis is more art than a science, and on the parameters choose are 0of , an may y bemost set toguidance the num numb berthis of τ , and τ . Usually τ ma sub ject should b e regarded with some skepticism. When using the linear schedule, iterations required to make a few hundred passes through the training set. Usually parameters to choose are , , and τ. Usually τ may be set to the number of the τ should b e set to roughly 1% the value of 0. The main question is how to set 0 . iterations to make a few hundred passes through the trainingwith set. the Usually If it is to too orequired large, the learning curve will sho show w violen violent t oscillations, cost should often be setincreasing to roughlysignifican 1% the vtly alue of . The main question is how to set if. function significantly tly. . Gentle oscillations are fine, esp especially ecially If it is towith o large, the learning willsuc sho t oscillations, with from the cost training a sto stoc chastic cost curve function such hw as violen the cost function arising the function often increasing significan tly . Gentle oscillations are fine, esp ecially if use of drop dropout. out. If the learning rate is to too o lo low, w, learning pro proceeds ceeds slowly slowly,, and if the training with a sto c hastic cost function suc h as the cost function arising from the initial learning rate is to too o low, learning may become stuck with a high cost value. use of drop out.optimal If the learning rate is rate, too loin w,terms learning proceeds slowly , and the T ypically ypically, , the initial learning of total training time andif the initialcost learning is too than low, the learning mayrate become high cost value. final value,rate is higher learning that stuck yieldswith the baest performance T ypically the 100 optimal initialorlearning rate, in it terms of total training time and the after the ,first iterations so. Therefore, is usually best to monitor the first final cost v alue, is higher than the learning rate that yields the b est p erformance sev several eral iterations and use a learning rate that is higher than the best-p est-performing erforming after the rate first at 100this iterations or so. it is best toinstabilit monitorythe learning time, but not Therefore, so high that it usually causes severe instability . first several iterations and use a learning rate that is higher than the best-performing The most imp important ortant prop property erty of SGD and related minibatch or online gradien gradienttlearning rate at this time, but not so high that it causes severe instability. based optimization is that computation time per up update date do does es not gro grow w with the The most imp ortant prop erty of SGD and related minibatch or online tnum umb ber of training examples. This allows conv convergence ergence even when the gradien num numb ber based optimization computation update does dataset, not growSGD withmay the of training examplesisbthat ecomes very large.time Forpaerlarge enough n umvberge er oftotraining examples. allows ergence whenbthe ber con conv within some fixed This tolerance of conv its final test even set error eforenum it has of training examples becomesset. very large. For a large enough dataset, SGD may pro processed cessed the en entire tire training converge to within some fixed tolerance of its final test set error before it has To study the conv convergence ergence rate of an optimization algorithm it is common to processed the entire training set. measure the exc excess ess err error or J ( θ) − min θ J (θ ), which is the amoun amountt that the curren currentt T o study the conv ergence rate of an optimization algorithm it is common to cost function exceeds the minim minimum um possible cost. When SGD is applied to a conv convex ex J ( θ ) J ( θ min measure the ess err or is O ( √1 ) after k )iterations, , which is the amoun t that the curren problem, the exc excess error while in the strongly conv convex ext k p ossible cost. When SGD is applied to a convex cost function 1exceeds the minim−um case it is O( k ). These bounds cannot be impro improved ved unless extra conditions are problem, the excess error is O ( ) after k iterations, while in the strongly convex assumed. Batch gradien gradientt descent enjoys better conv convergence ergence rates than sto stochastic chastic O ( case it is ) . These b ounds cannot b e impro ved unless extra conditions are, gradien gradientt descent in theory theory.. Ho Howev wev wever, er, the Cramér-Rao bound (Cramér, 1946; Rao assumed. Batch gradien t descent enjoys b etter conv ergence rates than sto chastic 1945 1945)) states that generalization error cannot decrease faster than O ( 1k ). Bottou gradient descent in theory. However, the Cramér-Rao bound (Cramér, 1946; Rao, 295 1945) states that generalization error cannot decrease faster than O ( ). Bottou
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and Bousquet (2008) argue that it therefore may not be worth worthwhile while to pursue an optimization algorithm that conv converges erges faster than O (k1 ) for machine learning and Bousquetconv (2008 ) argue that it therefore may berfitting. e worthwhile to pursue tasks—faster convergence ergence presumably corresp corresponds onds not to ov overfitting. Moreov Moreover, er, the O ( an optimization algorithm that conv erges faster than ) for machine learning asymptotic analysis obscures many adv advan an antages tages that sto stochastic chastic gradient descent tasks—faster conv ergence presumably corresp onds to ov the has after a small num numb ber of steps. With large datasets, theerfitting. abilit ability y ofMoreov SGD toer, make asymptotic many advthe antages that for stochastic gradient descent rapid initialanalysis progressobscures while ev evaluating aluating gradient only very few examples haswafter small num ber of steps. With large Most datasets, thealgorithms ability of SGD to make out outw eighsa its slow asymptotic con conv vergence. of the describ described ed in rapid initial progress while ev aluating the gradient for only very few examples the remainder of this chapter achiev achievee benefits that matter in practice but are lost out w eighs its slow asymptotic con vergence. of the algorithms describ in in the constant factors obscured by the O( 1k )Most asymptotic analysis. One caned also the remainder of this chapter benefits that matter in descen practice but are lost trade off the benefits of both achiev batch eand sto stochastic chastic gradient descent t by gradually in the constant factors hobscured by the ( ) asymptotic increasing the minibatc minibatch size during the O course of learning.analysis. One can also trade off the benefits of both batch and stochastic gradient descent by gradually For more information on SGD, see Bottou (1998). increasing the minibatch size during the course of learning. For more information on SGD, see Bottou (1998).
8.3.2
Momen Momentum tum
While 8.3.2 sto Momen tum descent remains a very popular optimization strategy stoc chastic gradient strategy,, learning with it can sometimes be slow. The metho method d of momentum (Poly olyak ak, 1964) While stochastic gradient descentesp remains a vthe eryface popular optimization strategy is designed to accelerate learning, especially ecially in of high curv curvature, ature, small but, learning with it can sometimes b e slow. The metho d of momentum ( P oly ak , 1964 consisten consistentt gradients, or noisy gradients. The momen momentum tum algorithm accumulates) is designed to accelerate learning, eciallyofinpast the gradien face of high curv ature, but an exp exponentially onentially deca decaying ying movingesp av average erage gradients ts and con contin tin tinues uessmall to mo mov ve consisten t gradients, or noisy gradients. The momen tum algorithm accumulates in their direction. The effect of momentum is illustrated in Fig. 8.5. an exponentially decaying moving average of past gradients and continues to move v that Formally ormally, , the momentum introduces a variable the role in their direction. The effect algorithm of momentum is illustrated in Fig. 8.5plays . of velocity—it is the direction and sp speed eed at whic which h the parameters mo move ve through Formallyspace. , the momentum algorithm a variable v that plays the role parameter The velo elocity city is set introduces to an exp exponentially onentially deca decaying ying av average erage of of velocity—it is the direction and sp eed at whic h the parameters mo ve through the negative gradient. The name momentum deriv derives es from a phys physical ical analogy analogy,, in parameter space. The v elo city is set to an exp onentially deca ying average of whic which h the negativ negativee gradient is a force mo moving ving a particle through parameter space, the negative gradient. The name momentum deriv es from a phys ical analogy , in according to Newton’s la laws ws of motion. Momen Momentum tum in physics is mass times velocity velocity.. whic h the negativ e gradient is a force mo ving a particle through parameter space, v In the momen learning algorithm, w e assume unit mass, so the v elo v momentum tum elocit cit city y ector ectorv according to Newton’s la ws of motion. Momen tum in physics is mass times velocity ma may y also be regarded as the momentum of the particle. A hyp yperparameter erparameter α ∈ [0 , 1). v. In the momen tum learning we assume unit gradients mass, so the vonen elocit y vector determines how quic quickly kly thealgorithm, contributions of previous exp exponen onentially tially decay decay. [0 , 1) α ma y also b e regarded as the momentum of the particle. A h yp erparameter The up update date rule is giv given en by: determines how quickly the contributions of previous gradients ! exponentially∈decay. m X The update rule is given by: 1 L(f (x (i); θ), y(i) ) , (8.15) v ← αv − ∇ θ m 1 i=1 L(f (x ; θ), y ) , (8.15) v α v θ ← θ + v. (8.16) m ← − ∇ 1 Pm (i) ( i) θ ulates θ + the v. gradient elemen (8.16) The velocity v accum accumulates elements ts ∇θ m i=1 L(f (x ; θ ), y ) . The larger α is relative gradients ts affect!the curren currentt direction. ←to , the more previous gradien X v L ( The velocity accum ulates the gradient elemen ts The SGD algorithm with momen momentum tum is given in Algorithm 8.2. f (x ; θ), y ) . The larger α is relative to , the more previous gradien ∇ ts affect the current direction. 296 The SGD algorithm with momentum is given in Algorithm 8.2.
P
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
20 10 0 −10 −20 −30 −30 −20 −10
0
10
20
Figure 8.5: Momentum aims primarily to solv solvee tw two o problems: poor conditioning of the Hessian matrix and variance in the sto stocchastic gradient. Here, we illustrate ho how w momen momentum tum 8.5: the Momentum aims tw primarily to solv e twcontour o problems: oor conditioning of loss the oFigure vercomes first of these two o problems. The linespdepict a quadratic Hessian matrix and v ariance in the sto c hastic gradient. Here, w e illustrate ho w momen tum function with a p o orly conditioned Hessian matrix. The red path cutting across the o vercomes the firstthe of path thesefollow two ed problems. The contour lines rule depict a quadratic loss con contours tours indicates followed by the momentum learning as it minimizes this function with a pstep o orlyalong conditioned red path across the function. At each the way ay,, Hessian we drawmatrix. an arrowThe indicating thecutting step that gradient con tours indicates the path follow ed b y the momentum learning rule as it minimizes this descen descentt would tak takee at that point. We can see that a po poorly orly conditioned quadratic ob objective jective function. eachnarro step w along theorwcany ay, w e draw an arrow indicating the correctly step that trav gradient lo looks oks like At a long, narrow valley canyon on with steep sides. Momentum traverses erses descen t would take at that point. We can see wthat poorly conditioned quadratic objective the cany canyon on lengthwise, while gradient steps asteatime moving bac back k and forth across the lo oks like a long, narro w v alley or cany on with steep sides. Momentum correctly trav erses narro narrow w axis of the can cany yon. Compare also Fig. 4.6, which sho shows ws the behavior of gradient the cany on lengthwise, while gradient steps waste time moving back and forth across the descen descent t without momentum. narrow axis of the canyon. Compare also Fig. 4.6, which shows the behavior of gradient descent without momentum.
297
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Previously Previously,, the size of the step was simply the norm of the gradient multiplied by the learning rate. No Now, w, the size of the step dep depends ends on how large and ho how w Previously , the size of the step was simply the norm of the gradient m ultiplied aligned a sequence of gradients are. The step size is largest when man many y successive b y the learning rate. No w, the size of the step dep ends on how large and hoys w gradien gradients ts point in exactly the same direction. If the momen momentum tum algorithm alwa always sequence aligned a of gradients are. The step size is largest when man y successive observ observes es gradient g , then it will accelerate in the direction of −g, until reac reaching hing a gradients velocity point inwhere exactly the same direction. terminal the size of eac each h step isIf the momentum algorithm always observes gradient g , then it will accelerate in the direction of g, until reaching a || is terminal velocity where the size of each||gstep − . (8.17) 1−α g 1 (8.17) . yperparameter It is thus helpful to think of the momentum erparameter in terms of 1−α . For 1 || || α hyp example, α = .9 corresp corresponds onds to multiplying the maxim maximum um sp speed eed by 10 relative to − It is thus helpful to think of the momentum h yp erparameter in terms of . For the gradient descen descentt algorithm. example, α = .9 corresponds to multiplying the maximum speed by 10 relative to Common values of α used in practice include .5, . 9, and .99 99.. Lik Likee the learning the gradient descent algorithm. rate, α ma may y also be adapted ov over er time. Typically it begins with a small value and α Common v alues of used in include , . 9, and Like the is later raised. It is less imp important ortantpractice to adapt than.99 to. shrink er time. α over.5time ovlearning rate, α may also be adapted over time. Typically it begins with a small value and is later raised. is less important to adapt α (SGD) over time than to shrink over time. Algorithm 8.2ItSto Stochastic chastic gradien gradient t descent with momentum Require: Learning rate , momentum parameter Algorithm 8.2 Stochastic gradient descent (SGD)α.with momentum Require: Initial parameter θ, initial velocity v. Require: Learning rate , momentum while stopping criterion not met do parameter α. Require: parameter , initial velocity SampleInitial a minibatch of mθexamples from thev.training set {x(1), . . . , x (m)} with while stopping (i). met do corresp corresponding ondingcriterion targets ynot Sample a minibatch of m examples training with 1fromP (i) set x ( i) , . . . , x Compute gradient estimate: g←m ∇θ the i L(f (x ; θ ), y ) corresp onding targets y . { } Compute velocity up update: date: v ← αv − g L ( f ( x ; θ ) , y ) Compute gradient estimate: g Apply up update: date: θ ← θ + v Compute αv ∇g end while velocity update: v ← Apply update: θ θ+v ← − end while ← We can view the momentum algorithm as simulating a particle sub subject ject to P con contin tin tinuous-time uous-time Newtonian dynamics. The physical analogy can help to build We can the momentum momentumand algorithm simulating a particle sub in intuition tuition for view ho how w the gradientasdescen descent t algorithms behav ehave. e. ject to continuous-time Newtonian dynamics. The physical analogy can help to build The position of the particle at any poin ointt in time is giv given en by θ (t ). The particle intuition for how the momentum and gradient descent algorithms behave. exp experiences eriences net force f (t). This force causes the particle to accelerate: The position of the particle at any point in time is given by θ (t ). The particle ∂ 2 the particle to accelerate: experiences net force f (t). This f force θ(t). (8.18) (t) =causes ∂ t2 ∂ Rather than viewing this as a second-order differential tial equation of the position, θ(differen t). (8.18) f (t) = ∂t we can introduce the variable v(t) representing the velo elocit cit city y of the particle at time Rather than viewing this as adynamics second-order differentialdifferential equation of the position, t and rewrite the Newtonian as a first-order equation: we can introduce the variable v(t) representing the velocity of the particle at time ∂ t and rewrite the Newtonian dynamics differential equation:(8.19) v(t) =as aθfirst-order (t), ∂t ∂ v(t) =298 θ(t), (8.19) ∂t
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
∂ v(t). (8.20) ∂t ∂ The momentum algorithm then consists the differential equations via f (t) = ofvsolving (t). (8.20) t numerical sim simulation. ulation. A simple numerical∂metho method d for solving differen differential tial equations The momentum algorithm then consists the differential equations is Euler’s metho method, d, whic which h simply consistsofofsolving simulating the dynamics definedvia by n umerical sim ulation. A simple numerical metho d for solving differen tial equations the equation by taking small, finite steps in the direction of eac each h gradient. is Euler’s method, which simply consists of simulating the dynamics defined by explains the basic formfinite of thesteps momen momentum tumdirection up update, date, but what sp specifically ecifically are the This equation by taking small, in the of eac h gradient. the forces? One force is prop proportional ortional to the negative gradien gradientt of the cost function: the basic form the momen tum up date,the butcost what specifically are −∇θThis J ( θ)explains . This force pushes theofparticle downhill along function surface. the forces? One force isalgorithm proportional to the negative of thebased cost function: The gradient descent would simply take agradien singlet step on each J ( θ ) . This force pushes the particle downhill along the cost function gradien gradient, t, but the Newtonian scenario used by the momentum algorithm surface. instead The would simply take W a esingle step based each −∇ gradient uses this forcedescent to alteralgorithm the velocity of the particle. can think of the on particle gradien butathe Newtonian scenario used momentum algorithm instead as beingt,like ho hock ck ckey ey puck sliding down anby icythe surface. Whenev Whenever er it descends a uses this force to alter the velocity of the particle. W e can think of the particle steep part of the surface, it gathers sp speed eed and contin continues ues sliding in that direction as b eing like a ho ck ey puck sliding down an icy surface. Whenever it descends a un until til it begins to go uphill again. steep part of the surface, it gathers speed and continues sliding in that direction necessary necessary. . If the only force is the gradien gradientt of the cost function, untilOne it bother eginsforce to goisuphill again. then the particle migh mightt never come to rest. Imagine a ho hock ck ckey ey puck sliding down One other force is necessary . If the only force is the gradien t of and the cost one side of a valley and straight up the other side, oscillating back forthfunction, forev forever, er, then the particle migh t never come to rest. Imagine a ho ck ey puck sliding down assuming the ice is perfectly frictionless. To resolve this problem, we add one one side of a prop valley and straight other side, oscillating, this backforce and forth forev er, t). the other force, proportional ortional to −v(up In physics terminology terminology, corresp corresponds onds assuming iceasisifpthe erfectly frictionless. resolveathis problem, we add to viscous the drag, particle must pushTothrough resistant medium suchone as ( t v other force, prop ortional to ) . In physics terminology , this force corresp onds syrup. This causes the particle to gradually lose energy ov over er time and ev even en entually tually to viscous drag, as if the particle m ust push through a resistant medium such as − con conv verge to a lo local cal minimum. syrup. This causes the particle to gradually lose energy over time and eventually −v (t) and viscous drag in particular? Part of the reason to Why y do use minimum. convWh erge to awe local use −v (t) is mathematical con conv venience—an integer pow ower er of the velocity is easy ( t v Wh y do we use ) and viscous drag in particular? of of thedrag reason to to work with. How However, ever, other physical systems ha have ve otherPart kinds based ( t v use ) is mathematical con v enience—an integer p ow er of the velocity is easy −ers of the velocity on other in integer teger pow owers velocity.. For example, a particle tra trav veling through to w ork with. How ever, otherdrag, physical have other to kinds drag of based the − air exp experiences eriences turbulent withsystems force prop proportional ortional the of square the on other in teger p ow ers of the velocity . F or example, a particle tra v eling through velo elocit cit city y, while a particle mo moving ving along the ground exp experiences eriences dry friction, with a the air exp eriences turbulent drag, with force prop ortional to theTurbulent square ofdrag, the force of constant magnitude. We can reject each of these options. vprop elo cit y , while a particle mo ving along the ground exp eriences dry friction, with a proportional ortional to the square of the velo elocit cit city y, becomes very weak when the velocity is force ofItconstant magnitude. We can rejectthe each of these options. Turbulent drag, small. is not pow owerful erful enough to force particle to come to rest. A particle proportional to the square of thethat veloexp cityeriences , becomes very weak when the velocity is with a non-zero initial velocity experiences only the force of turbulen turbulent t drag small. It is not powerful enough to force the particle come tofrom rest.the A starting particle will mo move ve aw away ay from its initial position forever, with thetodistance with a non-zero initial velocity that exp eriences only the force of turbulen t drag. poin ointt growing like O(log t). We must therefore use a low lower er pow ower er of the velocity velocity. will ve aw from initial position with the distance fromisthe If wemo use a pay ow ower er ofits zero, represen representing tingforever, dry friction, then the force to too ostarting strong. (log point growing We must use a low power the velocity When the forcelike dueOto thet).gradient of therefore the cost function isersmall butofnon-zero, the. If we use a pow er to of friction zero, represen tingthe dryparticle friction, forcebefore is tooreac strong. constan constant t force due can cause to then comethe to rest reaching hing When the force due to the gradient of the cost function is small but non-zero, the a lo local cal minim minimum. um. Viscous drag av avoids oids both of these problems—it is weak enough constant force due to friction can cause the particle to come to rest before reaching a local minimum. Viscous drag avoids b299 oth of these problems—it is weak enough f (t) =
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
that the gradient can contin continue ue to cause motion until a minim minimum um is reached, but strong enough to preven preventt motion if the gradient do does es not justify moving. that the gradient can continue to cause motion until a minimum is reached, but strong enough to prevent motion if the gradient does not justify moving.
8.3.3
Nestero Nesterov v Momen Momentum tum
Sutsk Sutskev ev ever erNestero et al. (2013 ) introduced ariantt of the momentum algorithm that was 8.3.3 v Momen tuma varian inspired by Nesterov’s accelerated gradien gradientt metho method d (Nesterov, 1983, 2004). The Sutsk ev er et al. ( 2013 ) introduced a v arian t of the momentum algorithm that was up update date rules in this case are given by: inspired by Nesterov’s accelerated gradient method (Nesterov, 1983, 2004). The " # m update rules in this case are given by: 1 X v ← αv − ∇θ L f (x (i); θ + αv), y (i) , (8.21) m i=1 1 v←θ αv+ v, L f (x ; θ + αv), y , (8.21) θ (8.22) m ← − ∇ θ θα + and v, pla (8.22) where the parameters play y a similar role as in the standard momentum " # ← metho method. d. The difference betw etween een Nestero Nesterov v momentum and standard momentum is X α where the parameters and pla y a similar role as in the standard where the gradient is ev evaluated. aluated. With Nestero Nesterov v momen momentum tum the gradient momentum is ev evaluated aluated metho d. The difference b etw een Nestero v momentum and standard momentum is after the curren currentt velocity is applied. Th Thus us one can interpret Nestero Nesterov v momentum where the gradient is evaaluated. With Nestero momen tum the gradient is evaluated as attempting to add corr orreection factor to vthe standard metho method d of momentum. aftercomplete the curren t velocity is applied.algorithm Thus oneiscan interpret v momentum The Nestero Nesterov v momentum presen presented ted inNestero Algorithm 8.3. as attempting to add a correction factor to the standard method of momentum. In the con conve ve vex x batch gradient case, Nestero Nesterov v momen momentum tum brings the rate of The complete Nesterov momentum algorithm is presented in Algorithm 8.3. con conv vergence of the excess error from O(1 (1/k /k) (after k steps) to O(1 (1/k /k2) as shown In the con x batch gradient case, vchastic momengradien tum brings rate of by Nestero Nesterov v (ve 1983 ). Unfortunately Unfortunately, , in Nestero the sto stochastic gradient t case,theNestero Nesterov v con v ergence of the excess error from ) (after steps) to ) as shown O (1 /k k O (1 /k momen momentum tum do does es not improv improvee the rate of conv convergence. ergence. by Nesterov (1983). Unfortunately, in the stochastic gradient case, Nesterov momentum do es Sto notchastic improvgradien e the rate of conv ergence. Algorithm 8.3 Stochastic gradient t descent (SGD) with Nesterov momentum Require: Learning rate , momentum parameter Algorithm 8.3 Stochastic gradient descent (SGD)α.with Nesterov momentum Require: Initial parameter θ, initial velocity v. Require: Learning rate , momentum while stopping criterion not met do parameter α. Require: parameter , initial velocity SampleInitial a minibatch of mθexamples from thev.training set {x(1), . . . , x (m)} with while do stopping criterion not met corresp corresponding onding lab labels els y(i) . Sample a minibatch of mθ˜ examples set x , . . . , x with Apply interim up update: date: ← θ + αvfrom the training P corresponding labels . Compute gradient (atyinterim point): g ← m1 ∇θ˜ i L(f{(x (i); θ˜), y (i)) } ˜ Apply interim update: θ v θ←+ααvv− g Compute velocity up update: date: L(f (x ; θ˜), y ) Compute gradient (at interim Apply up update: date: θ ← θ + v← point): g Compute αv g ← ∇ end while velocity update: v Apply update: θ θ+v ← − end while ← P 300
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
8.4
Parameter Initialization Strategies
Some algorithms are not iterative by nature and simply solve for a 8.4 optimization Parameter Initialization Strategies solution point. Other optimization algorithms are iterative by nature but, when Some optimization algorithms are not iterative nature simply solve for a applied to the righ rightt class of optimization problems,bycon converge verge and to acceptable solutions solution p oint. Other optimization algorithms are iterative b y nature but, when in an acceptable amoun amountt of time regardless of initialization. Deep learning training applied to the rightdo class optimization problems, converge to acceptable algorithms usually notof hav have e either of these luxuries. Training algorithmssolutions for deep in an acceptable amoun t of time regardless of initialization. Deep learning learning mo models dels are usually iterative in nature and thus require the user totraining sp specify ecify algorithms usually do not hav e either of these luxuries. T raining algorithms for deep some initial poin ointt from which to begin the iterations. Moreov Moreover, er, training deep learning mo dels are usually iterative in nature and thus require the useraffected to specify mo models dels is a sufficien sufficiently tly difficult task that most algorithms are strongly by some initial oint from which begin the can iterations. Moreov er, training deep the choice of pinitialization. The to initial point determine whether the algorithm models is aatsufficien tly some difficult task pthat algorithms are strongly by con conv verges all, with initial ointsmost being so unstable that the affected algorithm the choice initialization. The initial point can determine whether do theesalgorithm encoun encounters ters of numerical difficulties and fails altogether. When learning does conv converge, erge, con v erges at all, with some initial p oints being so unstable that the algorithm the initial poin ointt can determine how quickly learning conv converges erges and whether it encoun ters to numerical learning does converge, con conv verges a pointdifficulties with highand or fails low altogether. cost. Also,When points of comparable cost the initial p oin t can determine how quickly learning conv erges and whether it can hav havee wildly varying generalization error, and the initial point can affect the converges to aaspw oint generalization ell. with high or low cost. Also, points of comparable cost can have wildly varying generalization error, and the initial point can affect the Mo Modern dern initialization strategies are simple and heuristic. Designing improv improved ed generalization as well. initialization strategies is a difficult task because neural netw network ork optimization is strategies are simple and heuristic. Designing improv ed not Mo yetdern well initialization understo understoood. Most initialization strategies are based on achieving some initialization strategies a difficult becauseHow neural orknot optimization nice prop properties erties when theis netw network ork is task initialized. However, ever,netw we do hav havee a go goo ois d not y et well understo o d. Most initialization strategies are based on achieving some understanding of whic which h of these prop properties erties are preserv preserved ed under whic which h circumstances nice prop ertiesbwhen netw ork is However, we do notinitial have apoin go ots d after learning egins the to pro proceed. ceed. A initialized. further difficulty is that some oints understanding of whic h ofthe these prop erties preserved under which circumstances ma may y be beneficial from viewp viewpoint oint of are optimization but detrimental from the after learning b egins to pro ceed. A further difficulty is that some initial points viewp viewpoin oin ointt of generalization. Our understanding of how the initial poin ointt affects ma y b e b eneficial from the viewp oint of optimization but detrimental from the generalization is esp especially ecially primitiv primitive, e, offering little to no guidance for how to select viewp oint of generalization. Our understanding of how the initial point affects the initial poin oint. t. generalization is especially primitive, offering little to no guidance for how to select Perhaps the only prop property erty known with complete certaint certainty y is that the initial the initial point. parameters need to “break symmetry” b betw etw etween een differen differentt units. If tw two o hidden P erhaps the only prop erty known with complete certaint y is that the units with the same activ activation ation function are connected to the same inputs,initial then parameters need hav to “break symmetry” between differen t units. If same two hidden these units must have e different initial parameters. If they hav havee the initial units with the same activation function connected to the inputs, then parameters, then a deterministic learning are algorithm applied to asame deterministic cost these units must hav e different initial parameters. If they hav e the same initial and mo model del will constan constantly tly up update date both of these units in the same way. Ev Even en if the parameters, then a deterministic learning algorithm applied to a deterministic costt mo model del or training algorithm is capable of using sto stochasticit chasticit chasticity y to compute differen different anddates model constan tly up date both of ifthese the drop sameout), way. itEv if the up updates forwill differen different t units (for example, one units trainsinwith dropout), isenusually mo trainingeach algorithm capable ofa using sto to compute differen b estdeltoorinitialize unit tois compute differen different t chasticit functiony from all of the othert updatesThis for differen t units (fore example, oneinput trainspatterns with drop usually units. may help to mak make sure thatif no areout), lost itinisthe null b est to initialize each unit to compute a differen t function from all of the other space of forward propagation and no gradien gradientt patterns are lost in the null space units. This may helpThe to mak sure that eac nohinput patterns aare lost in function the null of back-propagation. goale of having each unit compute different space of forward propagation and no gradient patterns are lost in the null space 301 each unit compute a different function of back-propagation. The goal of having
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
motiv motivates ates random initialization of the parameters. We could explicitly searc search h for a large set of basis functions that are all mutually different from each other, motiv ates random the parameters. e could explicitly searc h but this often incursinitialization a noticeable of computational cost. W For example, if we ha have ve at for a large set outputs of basis functions are all mutually differentorthogonalization from each other, most as many as inputs, that we could use Gram-Schmidt but this often incurs a noticeable computational cost. F or example, if we ha at on an initial weight matrix, and be guaran guaranteed teed that each unit computes a ve very most ast many outputs as hinputs, we could use Gram-Schmidt orthogonalization differen different function from eac each other unit. Random initialization from a high-en high-entrop trop tropy y on an initial w eight matrix, and b e guaran teed that each unit computes a very distribution over a high-dimensional space is computationally cheaper and unlikely differen t function from each other from a high-entropy to assign an any y units to compute theunit. sameRandom functioninitialization as each other. distribution over a high-dimensional space is computationally cheaper and unlikely Typically ypically,, we set the biases for each unit to heuristically chosen constants, and to assign any units to compute the same function as each other. initialize only the weigh weights ts randomly randomly.. Extra parameters, for example, parameters T ypically , we set the biases for each unit to heuristically chosen constants, and enco encoding ding the conditional variance of a prediction, are usually set to heuristically only themweigh ts randomly . Extra cinitialize hosen constants uc uch h like the biases are. parameters, for example, parameters encoding the conditional variance of a prediction, are usually set to heuristically We almost alw alwa ays initialize all the weights in the mo model del to values dra drawn wn chosen constants much like the biases are. randomly from a Gaussian or uniform distribution. The cchoice hoice of Gaussian We almost always initialize the to weights the muc model to vhas alues wn or uniform distribution do does es notall seem matterinvery much, h, but notdra been randomly from a Gaussian or uniform distribution. The c hoice of Gaussian exhaustiv exhaustively ely studied. The scale of the initial distribution, how however, ever, do does es hav havee a or uniform distribution do es not seem to matter v ery muc h, but has not been large effect on both the outcome of the optimization pro procedure cedure and on the ability exhaustiv ely The scale of the initial distribution, however, does have a of the net netw w orkstudied. to generalize. large effect on both the outcome of the optimization procedure and on the ability Larger weights ts will yield a stronger symmetry breaking effect, helping of the netwinitial ork to weigh generalize. to av avoid oid redundan redundantt units. They also help to av avoid oid losing signal during forward or Larger initial weigh ts will yield a stronger symmetry bac back-propagation k-propagation through the linear comp componen onen onentt of each la lay ybreaking er—largereffect, valueshelping in the to avoidresult redundan t units. They of also help to avoid losing signal forward or matrix in larger outputs matrix multiplication. Initialduring weigh eights ts that are bac through the linear comp onen t of each layforward er—larger values in the to too o k-propagation large ma may y, how however, ever, result in explo exploding ding values during propagation or matrix result in larger outputs of matrix multiplication. Initial w eigh ts that are bac back-propagation. k-propagation. In recurrent netw networks, orks, large weigh weights ts can also result in chaos to o large ma y , how ever, result in explo ding v alues during forward or (suc (such h extreme sensitivity to small perturbations of the input thatpropagation the beha ehavior vior bacthe k-propagation. recurrent networks, pro large weigh ts ears can also result T ino chaos of deterministicInforw forward ard propagation procedure cedure app appears random). To some (suc h extreme sensitivity to small p erturbations of the input that the b eha vior exten extent, t, the explo exploding ding gradient problem can be mitigated by gradient clipping of the deterministic forw procedure appears random). Tot step). some (thresholding the values ofard the propagation gradients before performing a gradien gradient t descen descent exten t, the explo ding gradient problem can b e mitigated by gradient clipping Large weigh weights ts may also result in extreme values that cause the activ activation ation function (thresholding the values of the gradients before performing a gradient units. descentThese step). to saturate, causing complete loss of gradient through saturated Largeeting weighfactors ts may determine also resultthe in extreme values that the activ comp competing ideal initial scale of cause the weigh weights. ts. ation function to saturate, causing complete loss of gradient through saturated units. These The persp erspectiv ectiv ectives es of regularization and optimization can give comp eting factors determine the ideal initial scale of the weigh ts. very different insigh insights ts in into to how we should initialize a net network. work. The optimization persp erspective ective The pthat erspthe ectiv es of regularization optimization can information give very different suggests weights should be largeand enough to propagate successinsigh ts insome to how we should initialize network. making The optimization persp ective fully fully,, but regularization concerns aencourage them smaller. The use suggests that the w eights should b e large enough to propagate information successof an optimization algorithm such as sto stochastic chastic gradien gradientt descent that makes small fully , but some regularization concerns encourage making themthat smaller. The use incremen incremental tal changes to the weigh weights ts and tends to halt in areas are nearer to of aninitial optimization algorithm such chasticstuck gradien small the parameters (whether dueastosto getting in at descent region ofthat low makes gradien gradient, t, or incremental changes to the weights and tends to halt in areas that are nearer to the initial parameters (whether due to getting stuck in a region of low gradient, or 302
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
due to triggering some early stopping criterion based on overfitting) expresses a prior that the final parameters should be close to the initial parameters. Recall due toSec. triggering early stoppingwith criterion on oisverfitting) at from 7.8 thatsome gradien gradient t descent early based stopping equiv equivalent alentexpresses to weigh weight prior the final parameters shouldcase, be close to tthe initial parameters. Recall deca decay y that for some mo models. dels. In the general gradien gradient descen descent t with early stopping is from Sec. 7.8 that gradien t descent with early stopping is equiv alent to weigh not the same as weigh weightt deca decay y, but do does es provide a lo loose ose analogy for thinking ab about outt decaeffect y for some models. In the case, t descen with early stopping θ to θ 0 asis the of initialization. We general can think of gradien initializing thetparameters not thesimilar same as t decaayGaussian , but does prior provide ose mean analogy thinking out p(θa) lo θ0 .forFrom b eing to weigh imp imposing osing with this ab point theview, effectitofmakes initialization. We canθ think of initializing the parameters θ to θ as of sense to choose 0 to b e near 0. This prior says that it is more p(θ )than θ .doFrom b eing similar to imp osing a Gaussian prior withthat mean this pUnits oint lik likely ely that units do not interact with each other they interact. θ of ofteract view, it makes sense to choose to the be near 0. This prior says that itaisstrong more in interact only if the likelihoo likelihood d term ob objectiv jectiv jective e function expresses likely that units do not other hand, than that do interact. θ 0 to Units preference for them to interact interact.with On each the other if wthey e initialize large in teract only if the likelihoo d term of the ob jectiv e function expresses a strong values, then our prior sp specifies ecifies whic which h units should interact with eac each h other, and θ to large preference for them to interact. On the other hand, if w e initialize ho how w they should interact. values, then our prior specifies which units should interact with each other, and Some heuristics are av available ailable for cho hoosing osing the initial scale of the weights. One how they should interact. heuristic is to initialize the weights of a fully connected lay layer er with m inputs and Some heuristics are av ailable for c ho osing the initial scale the weights. One 1 1 n outputs by sampling each weigh weightt from U (− √m , √ m ), whileofGlorot and Bengio heuristic is to initialize the weights of a fully connected layer with m inputs and (2010) suggest using the normalize normalized d initialization n outputs by sampling each weight from U ( , ), while Glorot and Bengio 6 6 (2010) suggest using the W normalize d initialization −, √ ). (8.23) i,j ∼ U (−√ m+n m+n 6 6 Uto ( compromise , bet W ). (8.23) This latter heuristic is designed m + n m etw +wneen the goal of initializing √ and the goal of initializing all ∼ ation −√ variance all lay layers ers to ha have ve the same activ activation This latter heuristic is designed to compromise betform ween of la lay yers to ha have ve the same gradien gradientt variance. The formula ulatheis goal deriv derived edinitializing using the all lay ers to ha ve the same activ ation v ariance and the goal of initializing all assumption that the netw network ork consists only of a chain of matrix multiplications, layersnotononlinearities. have the sameReal gradien t vnetw ariance. The formula is deriv using the with neural networks orks ob obviously viously violate thised assumption, assumption that thedesigned network for consists only mo of del a chain of matrix multiplications, but many strategies the linear model perform reasonably well on its with no nonlinearities. Real neural netw orks ob viously violate this assumption, nonlinear counterparts. but many strategies designed for the linear model perform reasonably well on its Saxe et al. (2013) recommend initializing to random orthogonal matrices, with nonlinear counterparts. a carefully chosen scaling or gain factor g that accoun accounts ts for the nonlinearit nonlinearity y applied Saxe et al. ( 2013 ) recommend initializing to random orthogonal matrices, with at each lay layer. er. They derive specific values of the scaling factor for different typ ypes es of a carefully c hosen scaling or gain factor that accoun ts for the nonlinearit y applied g nonlinear activ activation ation functions. This initialization scheme is also motiv motivated ated by a at each lay er. They derive specific v alues of the scaling factor for different types of mo model del of a deep net network work as a sequence of matrix multiplies without nonlinearities. nonlinear ation functions. This initialization scheme is also motivnum atedber byofa Under suc such hactiv a mo model, del, this initialization sc scheme heme guaran guarantees tees that the total numb model of iterations a deep netrequired work as to a sequence of ergence matrix multiplies without nonlinearities. training reac reach h conv convergence is indep independent endent of depth. Under such a model, this initialization scheme guarantees that the total number of g hpushes Increasing the scaling the netw network tow toward ard the regime where training iterations requiredfactor to reac convergence is ork indep endent of depth. activ activations ations increase in norm as they propagate forward through the net netw work and g pushes Increasing the scaling the netw ork towSussillo ard the(regime where gradien gradients ts increase in normfactor as they propagate backw backward. ard. 2014) show showed ed activ ations increase in norm as they propagate forward through the net w ork and that setting the gain factor correctly is sufficient to train netw networks orks as deep as gradien ts increase in norm as they propagate backw ard. Sussillo 2014insight ) showed 1,000 la lay yers, without needing to use orthogonal initializations. A (key of that setting the gain factor correctly is sufficient to train networks as deep as 303 1,000 layers, without needing to use orthogonal initializations. A key insight of
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
this approach is that in feedforw feedforward ard netw networks, orks, activ activations ations and gradients can grow or shrink on eac each h step of forw forward ard or back-propagation, follo following wing a random walk this approach is that in feedforw ard netw orks, activ ations and can grow beha ehavior. vior. This is because feedforw feedforward ard netw networks orks use a differentgradients weigh eightt matrix at or shrink on eac h step of forw ard or back-propagation, follo wing a random w alk eac each h lay layer. er. If this random walk is tuned to preserve norms, then feedforward b eha vior. Thismostly is because ard netw orks use ding a different wtseigh t matrix at net netw works can av avoid oidfeedforw the vanishing and explo exploding gradien gradients problem that each lay er. the If this random alk is is tuned then arises when same weigh eightt w matrix used to at preserve eac each h step,norms, describ described ed infeedforward Sec. 8.2.5. networks can mostly avoid the vanishing and exploding gradients problem that Unfortunately Unfortunately,, these optimal criteria for initial weigh weights ts often do not lead to arises when the same weight matrix is used at each step, described in Sec. 8.2.5. optimal performance. This may be for three different reasons. First, we may Unfortunately , these optimal may criteria initial bweigh ts oftentodopreserv not lead to be using the wrong criteria—it not for actually e beneficial preserve e the optimal This may be fornet three First, imp we osed may norm of aperformance. signal throughout the entire netw work.different Second,reasons. the prop properties erties imposed be initialization using the wrong may not actually e beneficial to preserv e the at may criteria—it not persist after learning has bbegun to pro proceed. ceed. Third, the norm of a signal throughout the entire net w ork. Second, the prop erties imp osed criteria might succeed at improving the sp speed eed of optimization but inadverten inadvertently tly at initialization may not p ersist after learning has b egun to pro ceed. Third, the increase generalization error. In practice, we usually need to treat the scale of the criteria might succeed at improving the speed oflies optimization inadverten tly w eigh eights ts as a hyperparameter whose optimal value somewherebut roughly near but increase generalization In practice, we usually need to treat the scale of the not exactly equal to theerror. theoretical predictions. weights as a hyperparameter whose optimal value lies somewhere roughly near but One dra drawback wback to scaling rules that set all of the initial weigh weights ts to ha have ve the same not exactly equal to the theoretical predictions. 1 √ standard deviation, such as m , is that every individual weigh weightt becomes extremely One drawback to scaling rules that set all of the initial weights to have the same small when the lay layers ers become large. Martens (2010) introduced an alternative standard deviation, such as , is that every individual weight becomes extremely initialization scheme called sp sparse arse initialization in which eac each h unit is initialized to small when the layers bweigh ecomets.large. Martens (2010the ) introduced an alternative k non-zero ha hav ve exactly weights. The idea is to keep total amount of input to initialization scheme called sp arse initialization in which eac h unit is initialized to the unit indep independent endent from the num numb ber of inputs m without making the magnitude havindividual e exactly kweigh non-zero weigh ts. Thewith ideamis. to keepinitialization the total amount toe of weight t elemen elements ts shrink Sparse helps of toinput achiev achieve m the unit indep endent from the num b er of inputs without making the magnitude more diversit diversity y among the units at initialization time. How However, ever, it also imp imposes oses individual elemen ts shrink withare Sparsetoinitialization helps to achiev e m.chosen aofvery strongweigh priorton the weigh weights ts that ha have ve large Gaussian values. more diversit y among at initialization time. How ever, it also oses Because it takes a long the timeunits for gradient descen descentt to shrink “incorrect” largeimp values, a very strong priorsc on thecan weigh ts that are chosen to ha ve large Gaussian values. this initialization scheme heme cause problems for units such as maxout units that Because it takes a long time for gradient descen t to shrink “incorrect” large v alues, ha hav ve several filters that must be carefully co coordinated ordinated with each other. this initialization scheme can cause problems for units such as maxout units that computational allo allow w it, it is usually a go goo od idea to treat the haveWhen several filters that mresources ust be carefully coordinated with each other. initial scale of the weigh weights ts for eac each h lay layer er as a hyperparameter, and to cho hoose ose these When computational resources allo w it, it is usually a go o d idea to treat the scales using a hyperparameter search algorithm describ described ed in Sec. 11.4.2, such initial scale of the weigh ts for eac h lay er as a hyperparameter, and to c ho ose these as random search. The choice of whether to use dense or sparse initialization scales using a hyperparameter search Alternately algorithm describ ed in Sec. 11.4.2 , such can also be made a hyp yperparameter. erparameter. Alternately, , one can man manually ually search for as random search. The choice of whether to use dense or sparse initialization the best initial scales. A go goo od rule of thum thumb b for choosing the initial scales is to can also b e made a h yp erparameter. Alternately , one can manuallyon search for lo look ok at the range or standard deviation of activ activations ations or gradients a single the b est initial scales. A go o d rule of thum b for choosing the initial scales is to minibatc minibatch h of data. If the weigh eights ts are to tooo small, the range of activ activations ations across the lo ok at the range or standard deviation of activations or gradients on netw a single minibatc minibatch h will shrink as the activ activations ations propagate forward through the network. ork. minibatc h of data. If the w eigh ts are to o small, the range of activ ations across the By rep repeatedly eatedly identifying the first lay layer er with unacceptably small activ activations ations and minibatch its willweigh shrink asisthe activations propagate forward through the network. increasing weights, ts, it possible to ev eventually entually obtain a netw network ork with reasonable By repeatedly identifying the first layer with unacceptably small activations and increasing its weights, it is possible to eventually obtain a network with reasonable 304
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
initial activ activations ations throughout. If learning is still to too o slow at this point, it can be useful to lo look ok at the range or standard deviation of the gradients as well as the initialations. activations throughout. If in learning is still too slow at and this is point, it canless be activ activations. This pro procedure cedure can principle be automated generally useful to look at costly the range standard deviation of the gradients well as the computationally thanor hyperparameter optimization based on as validation set activ ations. This pro cedure can in principle be automated and is generally less error because it is based on feedback from the beha ehavior vior of the initial mo model del on a computationally costly than hyperparameter optimization based validation set single batch of data, rather than on feedback from a trained mo model del on on the validation errorWhile because it used is based on feedback the ehavior of the initial model more on a set. long heuristically heuristically, , thisfrom proto protocol colbhas recently been sp specified ecified single batch data, rather than onand feedback a ).trained model on the validation formally andofstudied by Mishkin Matasfrom (2015 set. While long used heuristically, this protocol has recently been specified more So far w wee ha have ve focused on the initialization of the w weights. eights. Fortunately ortunately,, formally and studied by Mishkin and Matas (2015). initialization of other parameters is typically easier. So far we have focused on the initialization of the weights. Fortunately, The approac approach h for setting the biases must be co coordinated ordinated with the approach initialization of other parameters is typically easier. for settings the weigh weights. ts. Setting the biases to zero is compatible with most weight The approac h for theabiases must bewhere coordinated the approach initialization schemes. setting There are few situations we ma may ywith set some biases to for settings the weights. Setting the biases to zero is compatible with most weight non-zero values: initialization schemes. There are a few situations where we may set some biases to non-zero alues: • If a v bias is for an output unit, then it is often beneficial to initialize the bias to obtain the righ rightt marginal statistics of the output. To do this, we assume that If a bias is for antsoutput unit,enough then itthat is often the bias to the initial weigh eights are small the beneficial output of to theinitialize unit is determined obtain the righ t marginal statistics of the output. T o do this, w e assume that • only by the bias. This justifies setting the bias to the inv inverse erse of the activ activation ation the initialapplied weightstoare enough that theofoutput of the in unit determined function thesmall marginal statistics the output theistraining set. only b y the bias. This justifies setting the bias to the inv erse of the activ ation For example, if the output is a distribution ov over er classes and this distribution function applied to the marginal statistics of the probability output in the training set. i giv is a highly sk skew ew ewed ed distribution with the marginal of class given en F or example, if the output is a distribution ov er classes and this distribution by element ci of some vector c, then we can set the bias vector b by solving is a equation highly skew ed distribution with applies the marginal probability of class given ( b) = c . This the softmax softmax( not only to classifiers but ialso to c c b b y element of some vector , then we can set the bias vector b y solving mo models dels we will encounter in Part III, such as auto autoenco enco encoders ders and Boltzmann ( b ) = c the equation softmax . This applies not only to classifiers to mac machines. hines. These mo models dels hav havee lay layers ers whose output should resemblebut thealso input mo dels we will encounter in Part I I I , such as auto enco ders and Boltzmann data x, and it can be very helpful to initialize the biases of such lay layers ers to mac hines. models have layers matc match h the These marginal distribution overwhose x. output should resemble the input data x, and it can be very helpful to initialize the biases of such layers to • Sometimes we ma may y distribution want to choose the much uch match the marginal over x . bias to avoid causing too m saturation at initialization. For example, we ma may y set the bias of a ReLU Sometimes we0.1 marather y want to 0choose bias to the avoid causing too much hidden unit to than to av avoid oidthe saturating ReLU at initialization. saturation at initialization. F or example, w e ma y set the bias of ReLU • This approach is not compatible with weigh weightt initialization schemesathat do hidden unit to 0.1 rather than 0 to av oid saturating the ReLU at initialization. not exp expect ect strong input from the biases though. For example, it is not This approachfor is not with weigh t initialization schemes recommended use compatible with random walk initialization (Sussillo , 2014that ). do not expect strong input from the biases though. For example, it is not • Sometimes a unit controls whether walk otherinitialization units are able to participate recommended for use with random (Sussillo , 2014). in a function. In suc such h situations, we ha hav ve a unit with output u and another unit Sometimes a unit other units are able whether to participate a h ∈ [0 uh ≈ 1inor [0,, 1] h as a gate 1],, then we controls can viewwhether that determines u function. In suc h situations, we ha v e a unit with output and another unit • uh ≈ 0. In these situations, we want to set the bias for h so that h ≈ 1 most h [0 , 1], then we can view h as a gate that determines whether uh 1 or 305 to set the bias for h so that h 1 most uh∈ 0. In these situations, we want ≈ ≈ ≈
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
of the time at initialization. Otherwise u do does es not ha hav ve a chance to learn. For example, Jozefowicz et al. (2015) advocate setting the bias to 1 for the u do of the gate time of atthe initialization. Otherwise not10.10 hav.e a chance to learn. forget LSTM mo model, del, describ described ed in es Sec. For example, Jozefowicz et al. (2015) advocate setting the bias to 1 for the forget gate of thetyp LSTM model, describ in Sec. . Another common ype e of parameter is a ved ariance or 10.10 precision parameter. For example, we can perform linear regression with a conditional variance estimate Another common type of parameter is a variance or precision parameter. For using the mo model del example, we can performp(linear regression variance estimate y | x) = N (y | wTwith x + ab, conditional 1/β ) (8.24) using the model where β is a precision parameter. variance or precision p(y x) =We(can y wusually x + b,initialize 1/β ) (8.24) parameters to 1 safely safely.. Another approac approach h is to assume the initial w eigh eights ts are close | Ne can | usually initialize variance or precision β where is a precision parameter. W enough to zero that the biases ma may y be set while ignoring the effect of the weigh weights, ts, parameters 1 safely Another approac h ismarginal to assumemean the initial eights are then set thetobiases to. pro produce duce the correct of thewoutput, andclose set enough to zero that the biases y be set while ignoring the effect of the weighset. ts, the variance parameters to the ma marginal variance of the output in the training then set the biases to produce the correct marginal mean of the output, and set Besides these simple constant or random metho methods ds of initializing mo model del paramethe variance parameters to the marginal variance of the output in the training set. ters, it is possible to initialize mo model del parameters using mach machine ine learning. A common Besides these simple constant or random ds of initializing momo del paramestrategy discussed in Part III of this book is metho to initialize a sup supervised ervised model del with ters, it is p ossible to initialize mo del parameters using mach ine learning. A common the parameters learned by an unsup unsupervised ervised mo model del trained on the same inputs. strategy discussed in Part I I I of this b o ok is to initialize supervised del with One can also perform sup supervised ervised training on a related atask. Even mo performing the parameters learned by an unsup ervised mo del trained on the same inputs. sup supervised ervised training on an unrelated task can sometimes yield an initialization that One can also perform ervised training on a related task. Eveninitialization performing offers faster conv convergence ergencesup than a random initialization. Some of these supervised may training an unrelated task can yield an initialization that strategies yieldonfaster conv convergence ergence andsometimes better generalization because they offersdefaster convergence random initialization. Some of these of initialization enco encode information ab about outthan the adistribution in the initial parameters the mo model. del. strategies may yield faster conv ergence and b etter generalization b ecause they Others apparently perform well primarily because they set the parameters to hav havee encorigh de information out the distribution in the initial functions parameters of eac thehmo del. the right t scale or setab differen different t units to compute different from each other. Others apparently perform well primarily because they set the parameters to have the right scale or set different units to compute different functions from each other.
8.5
Algorithms with Adaptiv daptive e Learning Rates
Neural network work researc researchers hers hav havee long realized ethat the learningRates rate was reliably one 8.5 net Algorithms with Adaptiv Learning of the hyperparameters that is the most difficult to set because it has a significant Neural net researc hers have As longwerealized that the learning rateand wasSec. reliably impact on work mo model del performance. ha hav ve discussed in Sec. 4.3 8.2, one the of the hyperparameters that is the most difficult to set b ecause it has a significant cost is often highly sensitiv sensitivee to some directions in parameter space and insensitiv insensitivee impact on mo del p erformance. As we ha v e discussed in Sec. 4.3 and Sec. 8.2, but the to others. The momentum algorithm can mitigate these issues somewhat, cost highly sensitiv etro toducing some directions in parameter space andface insensitiv e do does es issooften at the exp expense ense of in intro troducing another hyperparameter. In the of this, to isothers. momentum algorithm thesee that issuesthe somewhat, it naturalThe to ask if there is another can wa way y.mitigate If we believ elieve directionsbut of do es so at the exp ense of in tro ducing another hyperparameter. In the face of this, sensitivit sensitivity y are somewhat axis-aligned, it can make sense to use a separate learning it is natural ask if there another way.adapt If wethese believ e that the of rate for eac each h to parameter, and isautomatically learning ratesdirections throughout sensitivit y are somewhat axis-aligned, it can make sense to use a separate learning the course of learning. rate for each parameter, and automatically adapt these learning rates throughout the course of learning. 306
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
The delta-bar-delta algorithm (Jacobs, 1988) is an early heuristic approac approach h to adapting individual learning rates for mo model del parameters during training. The Thehdelta-bar-delta algorithm , 1988deriv ) is ativ an early heuristic approac h approac approach is based on a simple idea: (ifJacobs the partial derivativ ative e of the loss, with resp respect ect to aadapting individual learning rates the for same modelsign, parameters training. The to given mo model del parameter, remains then theduring learning rate should approac h is based on a simple idea: if the partial deriv ativ e of the loss, with resp ect increase. If the partial deriv derivative ative with resp respect ect to that parameter changes sign, to a given model parameter, the Of same sign, this thenkind the learning rateonly should then the learning rate shouldremains decrease. course, of rule can be increase. If the partial deriv ative with resp ect to that parameter changes sign, applied to full batch optimization. then the learning rate should decrease. Of course, this kind of rule can only be More recently recently,, a num numb ber of incremen incremental tal (or mini-batch-based) methods ha have ve applied to full batch optimization. been introduced that adapt the learning rates of mo model del parameters. This section numofber of incremen tal (or mini-batch-based) methods have will More brieflyrecently review, aa few these algorithms. been introduced that adapt the learning rates of model parameters. This section will briefly review a few of these algorithms.
8.5.1
AdaGrad
8.5.1 AdaGrad The AdaGrad algorithm, sho shown wn in Algorithm 8.4, individually adapts the learning rates of all mo model del parameters by scaling them inv inversely ersely prop proportional ortional to the square The AdaGrad algorithm, sho wn in Algorithm 8.4 , individually learning ro root ot of the sum of all of their historical squared values (Duc Duchi hiadapts et al. al.,,the 2011 ). The rates of all mo del the parameters by scaling them prop ortional to the square parameters with largest partial deriv derivative ativeinv ofersely the loss hav have e a corresp correspondingly ondingly root ofdecrease the sum all learning of their historical values (Duc hi et al., 2011 ).atives The rapid in of their rate, whilesquared parameters with small partial deriv derivatives parameters with the largest partial deriv ative of the loss hav e a corresp ondingly ha hav ve a relativ relatively ely small decrease in their learning rate. The net effect is greater rapid decrease theirgen learning rate, while parameters with small partial derivatives progress in theinmore gently tly slop sloped ed directions of parameter space. have a relatively small decrease in their learning rate. The net effect is greater In theincontext of gen con convex vexslop optimization, daGrad algorithm enjoys some progress the more tly ed directionsthe of A parameter space. desirable theoretical prop properties. erties. Ho Howev wev wever, er, empirically it has been found that—for In the context of con vex optimization, the ulation AdaGrad algorithm enjoysfrom some training deep neural netw network ork mo models—the dels—the accum accumulation of squared gradients gradientsfrom desirable theoretical properties. wever,inempirically it has been founde that—for canHoresult a premature and excessiv excessive decrease the beginning of training training deep neural netwrate. ork mo dels—the ulation of squared gradients in the effective learning A AdaGrad daGrad paccum erforms well for some but not allfrom deep the b eginning of training can result in a premature and excessiv e decrease learning mo models. dels. in the effective learning rate. AdaGrad performs well for some but not all deep learning models. 8.5.2
RMSProp
The RMSProp algorithm (Hin Hinton ton, 2012) mo modifies difies AdaGrad to perform better in the 8.5.2 RMSProp non-con ex setting by c hanging the gradien non-conv v gradientt accumulation into an exp exponentially onentially The RMSProp algorithm ( Hin ton , 2012 modifies to AdaGrad to rapidly performwhen betterapplied in the weigh eighted ted moving average. AdaGrad is )designed conv converge erge non-con vex by When changing the gradien t accumulation into to antrain exponentially to a conv convex ex setting function. applied to a non-conv non-convex ex function a neural w eigh ted moving a v erage. AdaGrad is designed to conv erge rapidly when applied net netw work, the learning tra trajectory jectory may pass through many differen differentt structures and to a conv ex function. When applied to a non-conv ex function to train a neural ev even en entually tually arrive at a region that is a lo locally cally con conv vex bowl. AdaGrad shrinks the network, rate the learning trato jectory may pass through many differen t structures and learning according the entire history of the squared gradient and ma may y ev en tually arrive at a region that is a lo cally con v ex b owl. AdaGrad shrinks the ha hav ve made the learning rate to too o small before arriving at such a conv convex ex structure. learning rate according to the entire history of the squared gradient and ma RMSProp uses an exp exponen onen onentially tially deca decaying ying average to discard history from they have made the learning rate too small before arriving at such a convex structure. RMSProp uses an exponentially decaying average to discard history from the 307
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.4 The AdaGrad algorithm Require: Global learning rate Algorithm 8.4 The AdaGrad algorithm Require: Initial parameter θ Require: learning δrate Require: Global Small constant , perhaps 10−7, for numerical stability Require: parameter θ InitializeInitial gradient accum accumulation ulation variable r = 0 Require: Small constant δ , p erhaps 10 , for numerical stability while stopping criterion not met do Initialize gradient accum ulation v ariable = 0training set {x(1), . . . , x (m)} with Sample a minibatch of m examples fromrthe while stopping (i). met do corresp corresponding ondingcriterion targets ynot P from (the m 1examples Sample a minibatch set x , . . . , x with i) training (i) Compute gradient: gof← i L(f (x ; θ ), y ) m ∇θ corresp onding targetsgradien y . t: r ← r + g g { } A ccum ccumulate ulate squared gradient: L ( f ( x ; θ ) , y ) Compute gradient: g Compute up update: date: ∆θ ← − δ+√r g. (Division and square ro root ot applied Accumulate squared gradien t: r r+g g ← ∇ elemen element-wise) t-wise) Compute update: ∆θ and square root applied ← g. (Division Apply up update: date: θ ← θ + ∆θ elemen t-wise) ←− end while P Apply update: θ θ + ∆θ end while ← extreme past so that it can conv converge erge rapidly after finding a con convex vex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl. extreme past so that it can converge rapidly after finding a convex bowl, as if it is shown its standard form initialized in Algorithm 8.5 that and combined with wereRMSProp an instance of the in AdaGrad algorithm within bowl. Nestero Nesterov v momentum in Algorithm 8.6. Compared to AdaGrad, the use of the RMSProp shown in its standard form in Algorithm 8.5 andthe combined with ρ, that controls mo moving ving averageisintroduces a new hyp yperparameter, erparameter, length scale Nestero v ving momentum of the mo moving av average. erage.in Algorithm 8.6. Compared to AdaGrad, the use of the moving average introduces a new hyperparameter, ρ, that controls the length scale Empirically Empirically,, RMSProp has been shown to be an effective and practical opof the moving average. timization algorithm for deep neural net networ wor works. ks. It is currently one of the go-to Empirically , RMSProp has b een shown to e an effective and practical opoptimization metho methods ds being employ employed ed routinely bby deep learning practitioners. timization algorithm for deep neural networks. It is currently one of the go-to optimization methods being employed routinely by deep learning practitioners.
8.5.3
Adam
8.5.3 Adamand Ba, 2014) is yet another adaptive learning rate optimization A dam (Kingma algorithm and is presen presented ted in Algorithm 8.7. The name “Adam” derives from A dam (Kingma and Bamoments.” , 2014) is yet adaptive learning optimization the phrase “adaptive In another the context of the earlierrate algorithms, it is algorithm and is presen ted in Algorithm 8.7 . The name “Adam” derives from perhaps best seen as a variant on the combination of RMSProp and momen momentum tum the phrase “adaptive moments.” In the context of the earlier algorithms, it is with a few important distinctions. First, in Adam, momen momentum tum is incorp incorporated orated p erhaps b est seen as a v ariant on the combination of RMSProp and momen tum directly as an estimate of the first order momen momentt (with exp exponen onen onential tial weigh weighting) ting) of with a few important distinctions. First, in Adam, momen tum is incorp orated the gradient. The most straightforw straightforward ard wa way y to add momentum to RMSProp is to directly as an estimate the firstgradien order ts. momen (with exponentialinweigh ting) of apply momentum to theofrescaled gradients. The tuse of momentum combination the gradient. straightforw ard way tomotiv add momentum to RMSProp is to with rescaling The do does es most not hav have e a clear theoretical motivation. ation. Second, Adam includes apply momentum to the rescaled gradien ts. The use of momentum in combination bias corrections to the estimates of both the first-order momen moments ts (the momentum with rescaling do es not hav e a clear theoretical motiv ation. Second, Adam includes term) and the (uncen (uncentered) tered) second-order momen moments ts to account for their initialization bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order308 moments to account for their initialization
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.5 The RMSProp algorithm Require: Global learning rate , decay rate ρ. Algorithm 8.5 The RMSProp algorithm Require: Initial parameter θ Require: Globalconstan learningt δrate , decay −6 ρ. Require: Small constant , usually 10rate , used to stabilize division b by y small Require: Initial parameter θ num umb bers. δ , usually Require: constantvariables InitializeSmall accumulation r =100 , used to stabilize division by small n um b ers. while stopping criterion not met do Initialize ariables r = from 0 Sample accumulation a minibatch ofvm examples the training set {x(1), . . . , x (m)} with while stopping (i) met do corresp corresponding ondingcriterion targets ynot . P Sample a minibatch of examples with m 1 i) ; θtraining (f (x(the ), y (i) ) set x , . . . , x Compute gradient: g ← m ∇θ i Lfrom corresp onding targets y . } Accum ccumulate ulate squared gradien gradient: t: r ← ρr + (1 − ρ)g g { L ( f ( x ; θ ) , y ) Compute gradient: g 1 Compute parameter up update: date: ∆θ = − √ δ+r g. ( √δ+r applied element-wise) Accumulate squared gradien ρr + (1 ρ)g g ∇t: r Apply up update: date: θ ← θ ← + ∆θ g. ( Compute parameter update: ∆θ = applied element-wise) ← − end while Apply update: θ θ + ∆θ − P end while ← at the origin (see Algorithm 8.7). RMSProp also incorp incorporates orates an estimate of the (uncen (uncentered) tered) second-order moment, ho how wev ever er it lacks the correction factor. Th Thus, us, at the origin (see Algorithm 8.7 ). RMSProp also incorp orates an estimate of the unlik unlikee in Adam, the RMSProp second-order moment estimate may hav havee high bias (uncen second-order wever itaslacks correction Thus, early intered) training. Adam is moment, generally ho regarded beingthe fairly robust factor. to the choice unlik e in Adam, the though RMSProp estimate e high from bias of hyperparameters, thesecond-order learning ratemoment sometimes needs may to behav changed earlysuggested in training. Adam is generally regarded as being fairly robust to the choice the default. of hyperparameters, though the learning rate sometimes needs to be changed from the suggested default.
8.5.4
Cho Choosing osing the Righ Rightt Optimization Algorithm
In this section, we discussed a series of related algorithms that each seek to address 8.5.4 Choosing the Righ t Optimization Algorithm the challenge of optimizing deep mo models dels by adapting the learning rate for each In this section, we A discussed a series of related algorithms that each seekshould to address mo model del parameter. t this point, a natural question is: whic which h algorithm one the challenge of optimizing deep mo dels b y adapting the learning rate for each cho hoose? ose? model parameter. At this point, a natural question is: which algorithm should one Unfortunately Unfortunately,, there is curren currently tly no consensus on this poin oint. t. Schaul et al. (2014) choose? presen presented ted a valuable comparison of a large num numb ber of optimization algorithms Unfortunately , there is currentasks. tly noWhile consensus on this suggest point. Schaul et al. (2014 across a wide range of learning the results that the family of) presented awith valuable comparison of a large numberbof optimization algorithms adaptive learning rates (represented y RMSProp andalgorithms AdaDelta) across a wide range of learning tasks. While the results suggest that the family of performed fairly robustly robustly,, no single best algorithm has emerged. algorithms with adaptive learning rates (represented by RMSProp and AdaDelta) Curren Currently tly tly,, the most popular optimization algorithms actively in use include performed fairly robustly, no single best algorithm has emerged. SGD, SGD with momentum, RMSProp, RMSProp with momen momentum, tum, AdaDelta Curren tly , the most p opular optimization algorithms actively in use and Adam. The choice of which algorithm to use, at this poin oint, t, seems to include dep depend end SGD, SGD with momentum, RMSProp, RMSProp with momen tum, AdaDelta largely on the user’s familiarity with the algorithm (for ease of hyperparameter and Adam. The choice of which algorithm to use, at this point, seems to depend tuning). largely on the user’s familiarity with the algorithm (for ease of hyperparameter 309 tuning).
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.6 RMSProp algorithm with Nestero Nesterov v momentum Require: Global learning rate , decay rate ρ, momentum co coefficient efficient α. Algorithm 8.6 RMSProp algorithm with Nesterov momentum Require: Initial parameter θ, initial velocity v. Require: learningvariable rate , decay InitializeGlobal accumulation r = 0 rate ρ, momentum coefficient α. Require: Initial parameter θ, initial while stopping criterion not met dovelocity v. Initialize ariable r = 0from the training set {x(1), . . . , x (m)} with Sample accumulation a minibatch ofvm examples while do stopping criterion not met corresp corresponding onding targets y(i). m examples Sample a minibatch of with Compute interim up update: date: θ˜ ←P θ +from αv the training set x , . . . , x corresponding targets y 1. ∇ ( i) ˜ (i) { } Compute gradient: g← i L(f (x ; θ ), y ) m ˜ θ˜ Compute interim update: +− αvρ)g g A ccum ccumulate ulate gradient: r ←θρr +θ (1 ˜ L ( Compute gradient: g ;θ Compute velocity up update: date: v← ← αv −f (√x g)., y ( √1)r applied element-wise) r Accumulate gradient: ← r ρr + (1 ρ)g g Apply up update: date: θ ← θ + v ∇ Compute velocity update: αv − applied element-wise) ←v g . ( end while Apply update: θ θ+v ← − P end while ←
8.6
Appro Approximate ximate Second-Order Metho Methods ds
In weximate discuss the application of second-order methods 8.6this section Appro Second-Order Methometho ds ds to the training of deep net netw works. See LeCun et al. (1998a) for an earlier treatment of this sub subject. ject. In this section w e discuss the application of second-order metho ds to the training For simplicity of exp exposition, osition, the only ob objectiv jectiv jectivee function we examine is the empirical of deep networks. See LeCun et al. (1998a) for an earlier treatment of this sub ject. risk: For simplicity of exposition, the only ob jective function we examine is the empirical m risk: 1 X J (θ) = E x,y∼ˆpdata (x,y)[L(f (x; θ ), y)] = L(f (x (i); θ), y (i) ). (8.25) m i=1 1 E J (θ) = [L(f (x; θ ), y)] = L(f (x ; θ), y ). (8.25) m readily to more general ob Ho How wev ever er the metho methods ds we discuss here extend objectiv jectiv jectivee functions that, for instance, include parameter regularization terms such as those However the methods7.we discuss here extend readily to more general ob jective discussed in Chapter functions that, for instance, include parameter regularization terms such as those X discussed in Chapter 7.
8.6.1
Newton’s Metho Method d
In Sec. 4.3 , we introduced second-order gradient metho methods. ds. In contrast to first8.6.1 Newton’s Metho d order metho methods, ds, second-order metho methods ds make use of second deriv derivatives atives to improv improvee In Sec. 4.3 , we introduced second-order gradient metho ds. In contrast tod.firstoptimization. The most widely used second-order metho method d is Newton’s metho method. We order metho ds, second-order metho ds make use of second deriv atives to improv no now w describ describee Newton’s metho method d in more detail, with emphasis on its application toe optimization. most widely used second-order method is Newton’s method. We neural netw network orkThe training. now describe Newton’s method in more detail, with emphasis on its application to Newton’s metho method d is an optimization sc scheme heme based on using a second-order Tayneural network training. lor series expansion to approximate J( θ) near some poin ointt θ 0, ignoring deriv derivativ ativ atives es Newton’s method is an optimization scheme based on using a second-order Taylor series expansion to approximate J( θ) near some point θ , ignoring derivatives 310
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Algorithm 8.7 The Adam algorithm Require: Step (Suggested default: 0.001) Algorithm 8.7 size The Adam algorithm Exponen onen onential tial deca decay y rates for momen momentt estimates, ρ1 and ρ2 in [0 [0,, 1) 1).. Require: Exp Require: Step size (Suggested default: 0 . 001 ) (Suggested defaults: 0.9 and 0.999 resp respectively) ectively) Require: Small Exponen tial deca y rates for momenstabilization. t estimates, ρ(Suggested and ρ indefault: [0 , 1). δ used constant for numerical Require: (Suggested defaults: 0.9 and 0.999 respectively) −8 ) 10 δ used Require: Small Require: Initial constant parameters θ for numerical stabilization. (Suggested default: ) 10 Initialize 1st and 2nd momen momentt variables s = 0, r = 0 Require: parameters InitializeInitial time step t=0 θ Initialize 1st andcriterion 2nd momen variables s = 0, r = 0 while stopping not tmet do Initialize step t =of0m examples from the training set {x(1), . . . , x (m)} with Sample time a minibatch while stopping (i). met do corresp corresponding ondingcriterion targets ynot P from (the m 1examples Sample a minibatch set x , . . . , x with i) training (i) Compute gradient: gof← i L(f (x ; θ ), y ) m ∇θ targets y . { } tcorresp ← t +onding 1 Compute gradient: L(f (xs ←; θρ),sy+ (1 ) − ρ )g Up Update date biased first g moment estimate: 1 1 tUp t + 1 ← ∇ Update date biased second moment estimate: r ← ρ2r + (1 − ρ2 )g g Up ρ s + (1 ρ )g s s ←date biased ˆ ← 1−ρ Correct bias in first first moment momen moment: t:estimate: s t 1 Update biased second moment estimate:r ← r ρ r +− (1 ρ )g g Correct bias in second momen moment: t: rˆ ← 1−ρ t P ˆ Correct bias in first moment: s 2 ← − s √ˆ Compute up update: date: ∆ θ = − (op (operations erations applied elemen element-wise) t-wise) rˆ Correct bias in second momen ˆt: ← r+δ Apply up update: date: θ ← θ + ∆θ ← Compute update: ∆θ = (operations applied element-wise) end while Apply update: θ θ+∆ −θ end while ← of higher order: 1 of higher order: J (θ) ≈ J (θ0) + (θ − θ 0) >∇θ J (θ0 ) + (θ − θ 0) > H (θ − θ0 ), (8.26) 2 1 ) J ( θ ) + J ( θ ) J ( θ ) + ( θ θ (θaluated θ ) at Hθ(θ. If θwe), then solv (8.26) where H is the Hessian of J with respect to θ 2ev evaluated solve e for 0 ≈ − ∇ − − the critical poin ointt of this function, we obtain the Newton parameter up update date rule: where H is the Hessian of J with respect to θ evaluated at θ . If we then solve for the critical point of this function, obtain Newton parameter update(8.27) rule: H −1∇the θ ∗ = θwe 0 − θJ (θ 0) (8.27) H =θ (θ ) definite H ), by rescaling Th Thus us for a lo locally cally quadraticθ function (with pJositive −1 − d jumps ∇ directly to the minimum. If the the gradien gradientt by H , Newton’s metho method H ), bterms), Th us for a lo cally quadratic function (with ositivearedefinite y rescaling ob objectiv jectiv jectivee function is conv convex ex but not quadratic p(there higher-order this H the gradien t by , Newton’s metho d jumps directly to the minimum. If the up can b e iterated, yielding the training algorithm asso with Newton’s update date associated ciated ob jectiv function is convex but metho method, d,e given in Algorithm 8.8. not quadratic (there are higher-order terms), this update can be iterated, yielding the training algorithm associated with Newton’s For surfaces that are not quadratic, as long as the Hessian remains positive method, given in Algorithm 8.8. definite, Newton’s metho method d can be applied iteratively iteratively.. This implies a tw two-step o-step For surfaces that are not quadratic, as long as the Hessian remains positive 311 iteratively. This implies a two-step definite, Newton’s method can be applied
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
J (θ) = Newton’s metho method d with ob objectiv jectiv jectivee Algorithm 8.8 1 Pm (i); θ ), y (i)). L ( f ( x i=1 m J (θ) = Algorithm 8.8 Newton’s method with ob jective Require: parameter L(fInitial (x ; θ ), y ). θ0 Require: Training set of m examples Require: Initial parameter θ met do while stopping criterion not P Require: T raining set of m 1 ( i) (i) Compute gradient: g ← examples ∇ θ i L(f (x ; θ ), y ) m P while stopping criterion not do met 1 ∇2θ i L(f (x(i) ; θ), y (i) ) PCompute Hessian: H ← m L(f (x ; θ), y ) Compute gradient: g Compute Hessian in inv verse: H−1 Compute up Hessian: Hθ← Compute update: date: ∆ = −∇ H−1 g L(f (x ; θ), y ) Compute Hessian ←∆θH ∇ Apply up update: date: θ =inθverse: + g Compute up date: ∆ θ = H end while P Apply update: θ = θ + ∆ −θ P end while iterativ iterativee pro procedure. cedure. First, up update date or compute the in inv verse Hessian (i.e. by up updating dating the quadratic approximation). Second, up update date the parameters according to Eq. iterativ e pro cedure. First, up date or compute the inverse Hessian (i.e. by updating 8.27 8.27.. the quadratic approximation). Second, update the parameters according to Eq. In Sec. 8.2.3, we discussed how Newton’s metho method d is appropriate only when 8.27. the Hessian is positive definite. In deep learning, the surface of the ob objective jective In Sec. 8.2.3 , we discussed how Newton’s metho d is appropriate only function is typically non-con non-convex vex with many features, suc such h as saddle points,when that the Hessian is p ositive definite. In deep learning, the surface of the ob jective are problematic for Newton’s metho method. d. If the eigenv eigenvalues alues of the Hessian are not function is typically non-con with many such asmetho saddle points, that all positive, for example, nearvex a saddle point,features, then Newton’s method d can actually are problematic Newton’s If the eigenv alues of the Hessian are not cause up updates dates tofor mov move e in themetho wrongd.direction. This situation can be av avoided oided all ositive, for example, nearCommon a saddle p oint, then Newton’s metho d canadding actually b yp regularizing the Hessian. regularization strategies include a cause up dates to mov e in the wrong direction. This situation can b e av oided constan constant, t, α, along the diagonal of the Hessian. The regularized up update date becomes by regularizing the Hessian. Common regularization strategies include adding a −1 H the (f (θHessian. ∇regularized = θ0 − [of (8.28) constant, α, along theθ∗diagonal update becomes 0 )) + αI ]The θ f (θ0 ). This regularization strategy metho method, d,(8.28) suc such h [H (fin(θapproximations θ = θ is used )) + αI ] fto (θ Newton’s ). as the Leven erg–Marquardt algorithm ( Leven erg , 1944 ; Marquardt , 1963 ), and Levenb b Levenb b ∇ − This strategy used in eigenv approximations Newton’s d, such w orksregularization fairly well as long as theisnegative eigenvalues alues of thetoHessian are metho still relatively as thetoLeven algorithm (Leven berg, 1944 ; Marquardt , 1963 ), and close zero.berg–Marquardt In cases where there are more extreme directions of curv curvature, ature, the w orks fairly w ell as long as the negative eigenv alues of the Hessian are still relatively value of α would hav havee to be sufficien sufficiently tly large to offset the negativ negativee eigen eigenv values. close to zero. In cases where there are more extreme directions of curv ature, the α α I Ho How wev ever, er, as increases in size, the Hessian becomes dominated by the diagonal α would hav vand aluethe of direction e tobbyeNewton’s sufficiently large to offset negativ e eigen values. chosen metho method d conv converges ergesthe to the standard gradient αI to However,by asαα. increases in size, the Hessian becomes dominated by ythe diagonal α ma divided When strong negative curv curvature ature is present, may need be so and the direction chosen b y Newton’s metho d conv erges to the standard gradient large that Newton’s metho method d would make smaller steps than gradient descent with α divided by . When strong negative curvature is present, α may need to be so a prop properly erly chosen learning rate. large that Newton’s method would make smaller steps than gradient descent with Bey Beyond ond the challenges created by certain features of the ob objective jective function, a properly chosen learning rate. suc such h as saddle poin oints, ts, the application of Newton’s metho method d for training large neural Bey ond the c hallenges created by certain features of the obitjective function, net netw works is limited by the significant computational burden imp imposes. oses. The such as saddle points, the application of Newton’s method for training large neural networks is limited by the significant312 computational burden it imposes. The
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
num umb ber of elemen elements ts in the Hessian is squared in the num numb ber of parameters, so with k parameters (and for even very small neural netw networks orks the num numb ber of parameters n um b er of elemen ts in the Hessian is squared in the num b er of parameters, method d would require the inv inversion ersion of asokwith ×k k can be in the millions), Newton’s metho k parameters (and for even v ery small neural netw orks the num b er of parameters 3 matrix—with computational complexity of O( k ). Also, since the parameters be in with the millions), Newton’s metho would require inversion of k ery k k can will change every up update, date, the inv inverse ersedHessian has to the be computed ataev every ( k orks matrix—with computational complexity of O ). Also, since parameters × . As a consequence, only netw networks a with verythe small num numb ber training iteration at ev ery will change with every up date, the inv erse Hessian has to b e computed of parameters can be practically trained via Newton’s metho method. d. In the remainder training iteration . As a consequence, only netw orks a with er of this section, w wee will discuss alternativ alternatives es that attempt tovery gainsmall somenum of bthe of parameters can b e practically trained via Newton’s metho d. In the remainder adv advan an antages tages of Newton’s metho method d while side-stepping the computational hurdles. of this section, we will discuss alternatives that attempt to gain some of the advantages of Newton’s method while side-stepping the computational hurdles.
8.6.2
Conjugate Gradien Gradients ts
8.6.2 Conjugate Conjugate gradients isGradien a metho method dtsto efficiently av avoid oid the calculation of the inv inverse erse Hessian by iteratively descending conjugate dir direections ctions.. The inspiration for this Conjugate gradients is a metho d to efficiently av oid theofcalculation of inverse approac approach h follows from a careful study of the weakness the metho method d the of steep steepest est Hessian by iteratively descending conjugate ections The inspiration for this descen descentt (see Sec. 4.3 for details), where linedir searc searches hes .are applied iteratively in approac h follows from a careful study of the w eakness of the metho d of steep est the direction asso associated ciated with the gradient. Fig. 8.6 illustrates how the metho method d of descen t (see Sec. 4.3 for details), where line searc hes are applied iteratively in steep steepest est descent, when applied in a quadratic bowl, progresses in a rather ineffective the direction asso ciated with the gradient. Fig. 8.6 illustrates how the metho d of bac back-and-forth, k-and-forth, zig-zag pattern. This happ happens ens because each line searc search h direction, steepest descent, when appliedisinguaranteed a quadraticto bowl, progresses in rather ineffective when given by the gradient, be orthogonal toathe previous line bac k-and-forth, searc search h direction.zig-zag pattern. This happens because each line search direction, when given by the gradient, is guaranteed to be orthogonal to the previous line Let the previous searc search h direction be dt−1. At the minim minimum, um, where the line search direction. searc search h terminates, the directional deriv derivative ative is zero in direction dt−1: ∇ θ J (θ ) · dt defines previous h direction . At the um,search wheredirection, the line dt−1Let = the 0. Since the searc gradient at this b peoin theminim current oint searc h terminates,have the directional deriv is zero in direction J (θ ) d dt−1 e no contribution inative the direction . Thus ddt is :orthogonal t = ∇ θJ (θ ) will hav d d = .0. This Sincerelationship the gradientbetw at this pointand defines current search direction, ∇ dt istheillustrated to between een d in Fig. 8.6 for· t−1 t−1 d = J ( θ d d ) will hav e no contribution in the direction . Thus is orthogonal multiple iterations of steep steepest est descen descent. t. As demonstrated in the figure, the choice of d ∇ . This to relationship betw and d illustrated in the Fig.previous 8.6 for orthogonal directions of descent doeen notd preserve theisminimum along multiple iterations ofThis steep estesdescen t. the As demonstrated in the figure, thewhere choiceby of searc search h directions. giv gives rise to zig-zag pattern of progress, orthogonal to directions of descent not preserve the minimum previous descending the minimum in thedocurrent gradien gradient t direction, wealong mustthe re-minimize searc h directions. This giv es rise to the zig-zag pattern of progress, where the ob objectiv jectiv jectivee in the previous gradient direction. Thus, by follo following wing the gradient by at descending to the minimum in the current gradien t direction, we must re-minimize the end of each line searc search h we are, in a sense, undoing progress we hav havee already the obin jectiv in the previous gradientline direction. Thus, by follo the gradient at made the edirection of the previous search. The method ofwing conjugate gradients the end each line h we are, in a sense, undoing progress we have already seeks to of address this searc problem. made in the direction of the previous line search. The method of conjugate gradients In to theaddress metho method dthis of conjugate seeks problem. gradients, we seek to find a search direction that is conjugate to the previous line search direction, i.e. it will not undo progress made In the method of gradients,t,wthe e seek to search find a search direction that is dt tak in that direction. Atconjugate training iteration next direction takes es the conjugate to the previous line search direction, i.e. it will not undo progress made form: t, the next search direction d tak(8.29) in that direction. At trainingd iteration es the t = ∇ θJ (θ ) + β tdt−1 form: d = J 313 (θ) + β d (8.29) ∇
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
20 10 0 −10 −20 −30 −30 −20 −10
0
10
20
Figure 8.6: The metho method d of steep steepest est descent applied to a quadratic cost surface. The metho method d of steep steepest est descent in involv volv volves es jumping to the p oint of low lowest est cost along the line Figure 8.6: The metho d of steep est descent applied to a quadratic costofsurface. The defined by the gradient at the initial point on each step. This resolves some the problems metho d of steep est descent in volv es jumping to the p oint of low est cost along the seen with using a fixed learning rate in Fig. 4.6, but even with the optimal step size line the defined by the at the initial point on eachto step. some the problems algorithm stillgradient makes back-and-forth progress tow wardThis the resolves optimum. Byofdefinition, at seenminimum with usingofathe fixed learning rate aingiven Fig. 4.6 , but even the optimal step psize the ob objectiv jectiv jectivee along direction, thewith gradient at the final ointthe is algorithm still makes back-and-forth progress to w ard the optimum. By definition, at orthogonal to that direction. the minimum of the ob jective along a given direction, the gradient at the final p oint is orthogonal to that direction.
d t−1 , were βt is a co coefficient efficient whose magnitude controls ho how w muc much h of the direction, direction,d we should add back to the curren currentt search direction. were β is a coefficient whose magnitude controls how much of> the direction, d , Two directions, defined as conjugate if d t H (J )dt−1 = 00.. t−1, are t and we should add back dto the d curren t search direction. Two directions, d and d
d> d t−1 = defined , are as0conjugate if d H (J )d t H
=(8.30) 0.
= 0 would in d ose H dconjugacy The straightforw straightforward ard wa way y to imp impose inv volv olvee calculation (8.30) of the eigen eigenv vectors of H to choose β t, whic which h would not satisfy our goal of developing The straightforw ard wa y to imp ose conjugacy would involve calculation the a metho method d that is more computationally viable than Newton’s metho method d foroflarge H β eigen v ectors of to choose , whic h w ould not satisfy our goal of developing problems. Can we calculate the conjugate directions without resorting to these a method thatFortunately is more computationally viable than Newton’s method for large calculations? the answer to that is yes. problems. Can we calculate the conjugate directions without resorting to these Two popular metho methods ds for computing the β are: calculations? Fortunately the answer to that ist yes. o popular metho 1.TwFletc Fletcher-Reev her-Reev her-Reeves: es: ds for computing the β are: ∇ θ J (θt )> ∇θ J (θt ) βt = 1. Fletcher-Reeves: ∇θ J (θt−1 )> ∇θ J (θt−1) J (θ ) J (θ ) β = 2. P Polak-Ribière: olak-Ribière: J (θ ) ∇ J (θ ) ∇ (∇θJ (∇ θt ) − ∇θ J (θ∇t−1)) > ∇θJ (θt ) 2. Polak-Ribière: βt = ∇ J (θt−1 )> ∇θ J (θt−1) ( J (θθ) J (θ )) J (θ ) β = 314 J− (θ∇ ) J (θ ∇) ∇ ∇ ∇
(8.31) (8.31) (8.32) (8.32)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
For a quadratic surface, the conjugate directions ensure that the gradient along the previous direction do does es not increase in magnitude. We therefore stay at the F or a quadratic surface, the directions ensure that in thea gradient along minim minimum um along the previousconjugate directions. As a consequence, k -dimensional the previous direction does gradients not increase inrequires magnitude. e therefore stay at the the k lineWsearches parameter space, conjugate only to ac achieve hieve minim um along the previous directions. As a consequence, in a -dimensional k minim minimum. um. The conjugate gradient algorithm is given in Algorithm 8.9. parameter space, conjugate gradients only requires k line searches to achieve the minimum. The Algorithm 8.9 conjugate Conjugategradient gradien gradienttalgorithm metho method d is given in Algorithm 8.9. Require: Initial parameters θ0 Algorithm 8.9 Conjugate gradien t method Require: Training set of m examples Require: InitializeInitial ρ0 = 0parameters θ Require: T set of m examples Initialize graining 0 =0 Initialize tρ==1 0 Initialize Initialize g = 0 criterion not met do while stopping Initialize t= 1 gradien Initialize the gradientt gt = 0 P while stopping criterion not1 ∇ met doL(f (x (i) ; θ), y(i) ) Compute gradient: gt ← i m θ Initialize the gradient−1 t g) > = gt 0 Compute βt = (gtg−g (P (Polak-Ribière) olak-Ribière) > g L(f (x ; θ), y ) Compute gradient:t−1 g t−1 (Nonlinear conjugate gradien gradient: t: optionally ← ∇ Compute β = (Polak-Ribière)reset βt to zero, for example if t is a multiple of some constant k, suc such h as k = 5) (Nonlinear conjugate gradien t: optionally reset β to zero, for example if t is Compute search direction: ρt = −gt + βtρ t−1 a erform multiple of searc somehconstant suc h as k =1 5P ) m L(f (x (i); θ + ρ ), y(i)) P argmin P line search to find: k∗, = m t t i=1 βρ = g + analytically Compute search direction: (On a truly quadratic costρ function, solve for ∗ rather than L(f (x ; θ + ρ ), y ) Perform line search for to find: explicitly searching it) =−argmin (On a truly quadratic cost function, analytically solve for rather than Apply up update: date: θt+1 = θ t + ∗ρ t texplicitly ← t + 1 searching for it) = θ + ρ Apply update: θ end while P t t+1 end←while
havee discussed the metho method d of Nonlinear Conjugate Gradients: So far we hav conjugate gradients as it is applied to quadratic ob objectiv jectiv jectivee functions. Of course, Nonlinear Conjugate Gradients: So far we hav e discussed the metho d of our primary interest in this chapter is to explore optimization metho methods ds for training conjugate gradients as it is applied to quadratic ob jectiv e functions. Of course, neural netw networks orks and other related deep learning mo models dels where the corresp corresponding onding our primary interest in this chapter is to explore optimization metho ds for training ob objectiv jectiv jectivee function is far from quadratic. Perhaps surprisingly surprisingly,, the metho method d of neural netw orks andisother related deep learning mothough dels where onding conjugate gradients still applicable in this setting, with the somecorresp mo modification. dification. ob jectiveany function is far from quadratic. Perhaps surprisingly , the methodare of Without assurance that the ob objectiv jectiv jectivee is quadratic, the conjugate directions conjugate gradients is still applicable in this setting, though with some mo dification. no longer assured to remain at the minim minimum um of the ob objectiv jectiv jectivee for previous directions. Without any assurance that the ob jectiv e is quadratic, theincludes conjugate directions are As a result, the nonline nonlinear ar conjugate gr gradients adients algorithm occasional resets no longer at thegradients minimumisofrestarted the ob jectiv e for directions. where theassured metho method dtoofremain conjugate with lineprevious search along the As a result, the nonline ar c onjugate gr adients algorithm includes o ccasional resets unaltered gradient. where the method of conjugate gradients is restarted with line search along the Practitioners rep report ort reasonable results in applications of the nonlinear conjugate unaltered gradient. 315in applications of the nonlinear conjugate Practitioners report reasonable results
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
gradien gradients ts algorithm to training neural netw networks, orks, though it is often beneficial to initialize the optimization with a few iterations of stochastic gradien gradientt descent before gradien ts algorithm to training neural netw orks, though it is often beneficial to commencing nonlinear conjugate gradients. Also, while the (nonlinear) conjugate initializetsthe optimization with a few iterations of stochastic gradien t descent before gradien gradients algorithm has traditionally been cast as a batch metho method, d, minibatc minibatch h commencing nonlinear conjugate gradients. Also, while the (nonlinear) conjugate versions hav havee been used successfully for the training of neural netw networks orks (Le et al., gradien tsdaptations algorithm of has traditionally beensp cast as a batch metho d, minibatc 2011 2011). ). A Adaptations conjugate gradients specifically ecifically for neural netw networks orks hav haveh e vbersions hav e b een used successfully for the training of neural netw orks ( Le et al. een prop proposed osed earlier, such as the scaled conjugate gradien gradients ts algorithm (Moller,, 2011 ). Adaptations of conjugate gradients specifically for neural networks have 1993 1993). ). been proposed earlier, such as the scaled conjugate gradients algorithm (Moller, 1993).
8.6.3
BF BFGS GS
The Br Broyden–Fletcher–Goldfarb–Shanno oyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm attempts to bring some 8.6.3 BFGS of the adv advantages antages of Newton’s metho method d without the computational burden. In that The Broyden–Fletcher–Goldfarb–Shanno (BFGS) attempts bring some resp respect, ect, BFGS is similar to CG. How Howev ev ever, er, BFGSalgorithm tak takes es a more directtoapproach to of the adv antages of Newton’s metho d without the computational burden. In that the approximation of Newton’s up update. date. Recall that Newton’s up update date is given by respect, BFGS is similar to CG. However, BFGS takes a more direct approach to −1 the approximation of Newton’s by θ∗ =upθdate. (θ0),Newton’s update is given ∇ θJthat (8.33) 0 − HRecall = θ resp Hect toJθ(θev )aluated , (8.33) where H is the Hessian of θJ with respect evaluated at θ 0 . The primary ∇ up computational difficult difficulty y in applying− Newton’s update date is the calculation of the H J θ θ . The where is the Hessian of with resp ect to ev aluated at metho primary −1 in inv verse Hessian H . The approac approach h adopted by quasi-Newton methods ds (of which computational difficult y in applying Newton’s up date is the calculation of the the BFGS algorithm is the most prominent) is to approximate the inv inverse erse with H is. iterativ invmatrix erse Hessian The approac h adopted byrank quasi-Newton metho ds (of M t that a by lo up become a bwhich etter iteratively ely refined low w updates dates to the BFGS algorithm −1is the most prominent) is to approximate the inverse with appro approximation ximation of H . a matrix M that is iteratively refined by low rank updates to become a better From Newton’s update, date, in Eq. 8.33, we can see that the parameters at learning appro ximation of Hup . steps t are related via the secan secantt condition (also known as the quasi-Newton From Newton’s update, in Eq. 8.33, we can see that the parameters at learning condition): steps t are relatedθvia −the secan condition (also θt = −Ht −1 (∇θ J (θt+1 ) − known ∇θ J (θtas )) the quasi-Newton (8.34) t+1 condition): Eq. 8.34 holds precisely inθ the approximately otherwise.(8.34) The = quadratic H ( case, θ J (θ or) appro Jximately (θ )) appro to the Hessian inv used in the BFGS pro is constructed approximation ximation erse ∇ procedure cedure −in theinverse −quadratic ∇ ximately Eq.as8.34 holds this precisely case,oforH− appro The −1. Sp so to satisfy condition, with Specifically ecifically ecifically,,otherwise. updated dated M in place M is up approximation according to: to the Hessian inverse used in the BFGS procedure is constructed so as to satisfy this condition, with M in place of H . Specifically, M is updated according to: φ > Mt−1 φ φ> φ ∆φ>M t−1 + M t−1φ∆> − Mt = Mt−1 + 1 + , (8.35) ∆>φ ∆> φ ∆ >φ φ M φ φ φ ∆φ M + M φ∆ M =M + 1+ , (8.35) φ g t−1 ∆ φ 8.35 shows that the where g t = ∇θ J (θ t), φ =∆g t − andφ∆ = θt − θt−1.∆Eq. − BF BFGS GS pro procedure cedure iteratively refines the appro approximation ximation of the inv inverse erse of the Hessian g θ g = J ( θ φ = g ∆ = θ where ) , and . Eq. 8.35 that the n with rank up updates dates of rank one. This mean that if θ ∈ R , then theshows computational BFGS procedure of the inverse of the Hessian ∇ iteratively refines − the approximation − R with rank updates of rank one. This mean , then the computational 316 that if θ ∈
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
complexit complexity y of the up update date is O( n2). The deriv derivation ation of the BFGS approximation is giv given en in man many y textb textbo ooks on optimization, including Luenberger (1984). complexity of the update is O( n ). The derivation of the BFGS approximation is Once the inv inverse erse Hessian appro approximation ximation Mt is updated, the direction of descent given in many textbooks on optimization, including Luenberger (1984). ρt is determined by ρt = Mt g t. A line searc search h is performed in this direction to Once the erseofHessian appro updated,The the direction of descent M is ∗ , ximation determine theinv size the step, taken in this direction. final up update date to the ρ is determined ρ = M g . A line search is performed in this direction to byby: parameters is giv given en ∗ direction. The final up date to the determine the size of the step, θ, taken in this (8.36) t+1 = θt + ρt . parameters is given by: The complete BF BFGS GS algorithm isθ presented 8.10. = θ + in ρAlgorithm . (8.36) The complete8.10 BFGS algorithm Algorithm BFGS metho method dis presented in Algorithm 8.10. Require: Initial parameters θ Algorithm 8.10 BFGS metho0d Initialize inv inverse erse Hessian M0 = I Require: Initial parameters θ met do while stopping criterion not Initialize invgradient: erse Hessian Compute gt =M ∇θ= J (Iθt ) while stopping − θt−1 Compute φ =criterion gt − g t−1not , ∆met =θtdo Compute gradient: g = J (θ ) φ >M t−1φ φ>φ ∆φ > Mt−1 +Mt−1 φ∆> −1 Appro Approx x H : M = M + 1 + ∆> φ − ∆>φ ∆ >φ θ Compute φ = g t g t−1 ,∇ ∆=θ Compute search direction: ρt = Mt gt − Approx H : M−= M + ∗ 1 + Perform line searc search h to find: = argmin J (θt + ρ t) Compute searchθ direction: ρ∗ = M g − Apply up update: date: t+1 = θ t + ρ t Perform end while while* *line search to find: = argmin J (θ + ρ ) = θ + ρ Apply update: θ end while* Lik Likee the metho method d of conjugate gradients, the BFGS algorithm iterates a series of line searches with the direction incorp incorporating orating second-order information. Ho Howev wev wever er Lik e the metho d of conjugate gradients, the approach BFGS algorithm iteratesdep a series of unlik unlikee conjugate gradients, the success of the is not heavily dependent endent linethe searches with the direction incorp information. er on line search finding a point veryorating close tosecond-order the true minimum along Ho thewev line. unlik e conjugate gradients, the success of the approach is not heavily dep endent Th Thus, us, relative to conjugate gradien gradients, ts, BFGS has the adv advan an antage tage that it can sp spend end on the line search finding a p oint v ery close to the true minimum along the line. less time refining each line search. On the other hand, the BFGS algorithm must Th us, the relative to Hessian conjugate gradien BFGS has the tage that it can BFGS spend Mts, O(adv n2)an store inv inverse erse matrix, , that requires memory memory, , making less time refining eachmo line search. On the other hand, BFGS ha algorithm must impractical for most modern dern deep learning mo models dels thatthe typically hav ve millions of M O ( n store the inv erse Hessian matrix, , that requires ) memory , making BFGS parameters. impractical for most modern deep learning models that typically have millions of parameters. BFGS GS Limited Memory BF BFGS GS (or L-BF L-BFGS) GS) The memory costs of the BF algorithm can be significantly decreased by avoiding storing the complete inv inverse erse Limitedapproximation Memory BF GS (or L-BF The memory costs GS M Hessian . Alternativ Alternatively ely ely,GS) , by replacing the Mt−1 in of Eq.the 8.35BF with algorithm be significantly by avoiding thebecomes: complete inverse an identit identity ycan matrix, the BF BFGS GS decreased search direction up update datestoring form formula ula Hessian approximation M . Alternatively, by replacing the M in Eq. 8.35 with an identity matrix, the BFGSρsearch formula becomes: (8.37) aφdate , t = −gdirection t + b∆ +up ρ =
g + b∆ + aφ, − 317
(8.37)
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
where the scalars a and b are given by: > φ ∆>g φ> gt where the scalars a and b are given bφy: t + > a = − 1+ > (8.38) ∆ φ ∆>φ ∆ φ φ φ ∆ g φ g + a = ∆>g1t+ (8.38) ∆ φ ∆ φ ∆ φ b= > (8.39) − ∆ φ ∆ g b = (8.39) with φ and ∆ as defined ab abov ove. e. ∆ov φ If used with exact line searches, the directions defined by Eq. 8.37 are mutually conjugate. Ho How wever, unlik unlikee the metho method d of φ andgradients, ∆ as defined with abov e. If used withwell exact line directions conjugate this pro procedure cedure remains beha ehav vedsearches, when thethe minim minimum um of defined by Eq.is8.37 are only mutually conjugate. How ever, unlik method to of the line search reached appro approximately ximately ximately.. This strategy can ebethe generalized conjugate gradients, this pro cedure remains well b eha v ed when the minim um of include more information ab about out the Hessian by storing previous values of φ and the ∆ . line search is reached only approximately. This strategy can be generalized to include more information about the Hessian by storing previous values of φ and ∆.
8.7
Optimization Strategies and Meta-Algorithms
Man Many tec techniques hniques are not exactly but rather general 8.7 y optimization Optimization Strategies and algorithms, Meta-Algorithms templates that can be sp specialized ecialized to yield algorithms, or subroutines that can be Man y orated optimization tec not exactly algorithms, but rather general incorp incorporated into man many y hniques differentare algorithms. templates that can be specialized to yield algorithms, or subroutines that can be incorporated into many different algorithms.
8.7.1
Batc Batch h Normalization
Batc Batch (Ioffe and Szegedy, 2015) is one of the most exciting recen recentt 8.7.1h normalization Batch Normalization inno ations in optimizing deep neural netw and it is actually not an optimization innov v networks orks Batch normalization (Ioffeit and ) is one of the most exciting recen algorithm at all. Instead, is a Szegedy metho method d ,of2015 adaptiv adaptive e reparametrization, motiv motivated atedt inno vations in optimizing deep neural orks b y the difficult difficulty y of training very deepnetw mo models. dels.and it is actually not an optimization algorithm at all. Instead, it is a method of adaptive reparametrization, motivated Very deep mo models dels in involv volv volvee the comp composition osition of several functions or lay layers. ers. The by the difficulty of training very deep models. gradien gradientt tells how to up update date each parameter, under the assumption that the other Very dels inIn volv e the comp several functions or layers. The. la lay yers dodeep not mo change. practice, weosition up update dateofall of the lay layers ers simultaneously simultaneously. gradienwe t tells up dateunexp eachected parameter, the en assumption that the other When mak makeehow theto up update, date, unexpected resultsunder can happ happen because man many y functions la y ers do not change. In practice, w e up date all of the lay ers simultaneously comp composed osed together are changed sim simultaneously ultaneously ultaneously,, using up updates dates that were computed. When e the update, can remain happen constant. because man y functions under we themak assumption thatunexp the ected other results functions As a simple comp osed together are changed sim ultaneously , using up dates that were computed example, supp suppose ose we hav havee a deep neural netw network ork that has only one unit per lay layer er under the assumption that the other functions remain constant. As a simple and do does es not use an activ activation ation function at eac each h hidden lay layer: er: yˆ = xw 1w2 w3 . . . wl . example, supp ose we hav e a deep neural netw ork that has er lay er i isunit h i =ph Here, w i pro provides vides the weigh weightt used by lay layer er i. The output ofonly lay layer erone i−1 wi . and do es notyˆ use activfunction ation function eachxhidden er: yˆ = xw w w of . . .the w. The output is aan linear of the at input , but a lay nonlinear function h w Here, vides the weight used by layer i. Thegradient outputt of er yˆi,issohwe=wish w eigh eights tsw wpro of lay 1 on to. i . Suppose our cost function has put a gradien yˆ istly x, butcan The output a .linear function of the input a nonlinear function of the yˆ sligh decrease slightly tly. The bac back-propagation k-propagation algorithm then compute a gradient w y ˆ w eigh ts . Suppose our cost function has put a gradien t of 1 on , so we wish to g = ∇wyˆ. Consider what happ happens ens when we mak makee an up update date w ← w − g. The decrease yˆ slightly. The back-propagation algorithm can then compute a gradient 318we make an up date w g= yˆ. Consider what happens when w g. The ∇ ← −
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
first-order Taylor series appro approximation ximation of yˆ predicts that the value of yˆ will decrease by g>g . If we wan wanted ted to decrease yˆ by .1, this first-order information av available ailable in y ˆ ˆ first-order T aylor series appro ximation of predicts that the will decrease .1 value of y the gradient suggests we could set the learning rate to g> g . Ho Howev wev wever, er, the actual g g y ˆ . b y . If w e wan ted to decrease b y 1 , this first-order information av ailable up update date will include second-order and third-order effects, on up to effects of orderinl . the suggests e could . However, the actual The gradient new value of yˆ is w given by set the learning rate to update will include second-order and third-order effects, on up to effects of order l . The new value of yˆ is given − g1 )( x(w1 by )(w w2 − g 2) . . . (w l − gl ). (8.40) Ql g term g ) . .from g up x(w )(w arising . (w this ). date is 2 g1 g2 (8.40) An example of one second-order update i=3 wi . Ql − if i=3 w −i is small, or−might be exp This term might b e negligible exponen onen onentially tially large w. An example of one second-order term arising from this update is g g if the weigh weights ts on lay layers ers 3 through l are greater than 1. This makes it very hard w is This term b e negligible if small, or beofexp tiallyto large to cho hoose ose might an appropriate learning rate, because themight effects anonen up update date the l if the weigh ts on lay ers 3 through are greater than 1 . This makes it very hard parameters for one la layer yer dep depends ends so strongly on all of the other lay layers. ers. Second-order to choose an algorithms appropriate learning the effects of anthat update the optimization address thisrate, issue bbecause y computing an up update date tak takes estothese Q parameters forinteractions one layer dep ends so strongly on can all ofsee thethat other ers.deep Second-order second-order into accoun account, t, but we in lay very netw networks, orks, Q optimization algorithms address this issue b y computing an up date that tak es these ev even en higher-order interactions can be significan significant. t. Ev Even en second-order optimization second-order interactions into accoun t, but we can see that ximations in very deep netw orks, algorithms are exp expensive ensive and usually require numerous appro approximations that prev prevent ent even higher-order interactions e significansecond-order t. Even second-order optimization them from truly accounting forcan all bsignificant interactions. Building algorithms are optimization expensive andalgorithm usually require approhop ximations that prev n-th order an for n >numerous 2 thus seems hopeless. eless. What can ent we them from truly accounting for all significant second-order interactions. Building do instead? an n-th order optimization algorithm for n > 2 thus seems hopeless. What can we Batc Batch h normalization provides an elegant wa way y of reparametrizing almost any deep do instead? net netw work. The reparametrization significantly reduces the problem of co coordinating ordinating Batc h normalization provides an elegant wa y of reparametrizing almost anyinput deep up updates dates across man many y lay layers. ers. Batc Batch h normalization can be applied to any nethidden work. The reparametrization significantly reduceshthe of co H be a minibatc or la lay yer in a netw network. ork. Let minibatch of problem activ activations ations ofordinating the lay layer er up dates across man y lay ers. Batc h normalization can b e applied to any input to normalize, arranged as a design matrix, with the activ activations ations for eac each h example or hidden layaerrow in aofnetw LetTH be a minibatc h ofreplace activations app appearing earing in the ork. matrix. o normalize H , we it withof the layer to normalize, arranged as a design matrix, with the activations for each example H − µ H , we replace it with appearing in a row of the matrix. T0o normalize H = , (8.41) σ H µ H = (8.41) where µ is a vector con containing taining the mean ofσ− each, unit and σ is a vector con containing taining the standard deviation of eac each h unit. The arithmetic here is based on broadcasting µ isµa and where vector taining thebemean of each unitrow andofσ the is amatrix vectorH con the vector thecon vector applied to every . taining Within σ to the standard deviation of eac h unit. The arithmetic here is based on broadcasting eac each h row, the arithmetic is elemen element-wise, t-wise, so Hi,j is normalized by subtracting µj the v ector and the v ector to b e everyop row of the µ σ H . Within H0 in exactly and dividing by σj . The rest of theapplied netw network orktothen operates erates on matrix the H µ eac h row, the arithmetic is elemen t-wise, so is normalized by subtracting same wa way y that the original netw network ork op operated erated on H . and dividing by σ . The rest of the network then operates on H in exactly the At training time, same way that the original network op1erated X on H . µ = H i,: (8.42) At training time, m 1 i µ= H (8.42) m 319
X
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
and
s
1X (8.43) (H − µ)2i , and m i 1 (8.43) (H µ) , σ= δ+ where δ is a small positive value such m as 10−8 imp imposed osed to av avoid oid encountering the √ − undefined gradient of z at z = 0. Crucially Crucially,, we back-propagate through iserations where a small pfor ositive value such as 10 and imp osed to avoiddeviation, encountering computing the mean the standard and the for these δop operations s z z w e back-propagate through undefined gradient of at = 0 . Crucially , X that the gradient will never prop applying them to normalize propose ose √ H . This means these op erations for computing the mean and the standard deviation, and an operation that acts simply to increase the standard deviation or meanfor of H . This applying them to normalize means that the of gradient will neverand propzero ose h ; the normalization op operations erations remov remove e the effect such an action i an operation that acts simply to increase thea standard deviation mean of out its comp component onent in the gradient. This was ma major jor innov innovation ation oforthe batch h ; the normalization operations e the effect of olved such adding an action and zero normalization approach. Previousremov approaches had inv involved penalties to out its comp onent in the gradient. This w as a ma jor innov ation of the batch the cost function to encourage units to ha have ve normalized activ activation ation statistics or normalization approach. Previous approaches had inv olved adding penalties to in olv interv to renormalize unit statistics after each gradient descen inv volved ed intervening ening descent t step. the cost function to encourage units to veerfect normalized activation statistics or The former approach usually resulted in ha imp imperfect normalization and the latter involvedresulted intervening to renormalize unittime statistics each algorithm gradient descen t step. usually in significant wasted as theafter learning rep repeatedly eatedly The former approach usually resulted in imp erfect normalization and the latter prop proposed osed changing the mean and variance and the normalization step rep repeatedly eatedly usuallythis resulted in Batch significant wasted time as the learning repeatedly undid change. normalization reparametrizes the algorithm mo model del to mak make e some proposed changing the mean and variance and thesidestepping normalization step repeatedly units alwa always ys be standardized by definition, deftly both problems. undid this change. Batch normalization reparametrizes the model to make some µ and σ ma Atalwa testys time, may ybbe replaced bydeftly running averages that ere collected units be standardized y definition, sidestepping bothwproblems. during training time. This allows the mo model del to be ev evaluated aluated on a single example, µ and σ may be replaced At test time, to aend verages that wereminibatc collected σ running without needing use definitions of µ and by that dep depend on an entire minibatch. h. during training time. This allows the model to be evaluated on a single example, = xw Revisiting thetoyˆuse see dep thatend weon cananmostly resolve e the 1 w2 . . . w lofexample, without needing definitions entire resolv minibatc h. µ and σwethat difficulties in learning this mo model del by normalizing h l−1. Supp Suppose ose that x is drawn w . .h. w example, Revisiting the yˆ = xw we see that we can mostly resolve the from a unit Gaussian. Then l−1 will also come from a Gaussian, b ecause the difficulties in learning this del by Ho normalizing . Supp ose that is drawn hl mo transformation from x to is linear. Howev wev wever, er, hl−1h will no longer hav haveexzero mean h from a unit Gaussian. Then will also come from a Gaussian, b ecause the and unit variance. After applying batc batch h normalization, we obtain the normalized transformation from x to h is linear. However, h will noerties. longer F hav zero mean ˆ h properties. or ealmost any l−1 that restores the zero mean and unit variance prop and unit v ariance. After applying batc h normalization, we obtain the normalized ˆ up update date to the lo low wer lay layers, ers, hl−1 will remain a unit Gaussian. The output yˆ ma may y ˆ h that restores zero linear mean function and unityˆv= ariance erties. in For almost any ˆ l−1prop wl h then b e learned as the a simple . Learning this mo is model del ˆ yˆ ma up date the low er layers, will remain a unit The do output y no now w verytosimple because thehparameters at the lo lower werGaussian. la layers yers simply not hav have e an ˆ = w h . Learning then bin e learned as a simple linear is function in this model In is effect most cases; their output alw alwa aysyˆrenormalized to a unit Gaussian. no w very simple b ecause the parameters at the lo wer la yers simply do not hav e an some corner cases, the low lower er lay layers ers can hav havee an effect. Changing one of the lo low wer effect in most cases; their output is alw a ys renormalized to a unit Gaussian. In la lay yer weigh weights ts to 0 can mak makee the output become degenerate, and changing the sign some lowtsercan layers e an effect.betw Changing of ythe lower h l−1one of onecorner of thecases, low lower erthe weigh weights flipcan thehav relationship etween een ˆ and . These layer weighare ts to 0 can mak e the output become degenerate, andup changing the ha sign situations very rare. Without normalization, nearly every update date would hav ve ˆ h y of one of the low er weigh ts can flip the relationship b etw een and . These an extreme effect on the statistics of hl−1. Batc Batch h normalization has thus made situations are v ery rare. Without normalization, nearly every date havofe this mo model del significan significantly tly easier to learn. In this example, the up ease of would learning an extreme theofstatistics of hlo . Batc h normalization thus made course came effect at theon cost making the low wer lay layers ers useless. In our has linear example, this model significantly easier to learn. In this example, the ease of learning of course came at the cost of making the lo wer layers useless. In our linear example, 320 σ=
δ+
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
the low lower er lay layers ers no longer hav havee an any y harmful effect, but they also no longer hav havee an any y beneficial effect. This is because we hav havee normalized out the first and second the low er layers whic no longer e an harmful effect, but they alsoInnoa deep longerneural have order statistics, which h is allhav that a ylinear netw network ork can influence. an ywbork eneficial effect. This isation because we havthe e normalized outcan theperform first and second net netw with nonlinear activ activation functions, lo lower wer la lay yers nonlinear order statistics, whic h is all that a linear netw ork can influence. In a deep neural transformations of the data, so they remain useful. Batch normalization acts to network withonly nonlinear activ ation functions, the lo werin layorder ers can erform nonlinear standardize the mean and variance of each unit to pstabilize learning, transformations of the data,bso they remain useful. Batch normalization to but allows the relationships etw etween een units and the nonlinear statistics of aacts single standardize only the mean and variance of each unit in order to stabilize learning, unit to change. but allows the relationships between units and the nonlinear statistics of a single Because the final lay layer er of the netw network ork is able to learn a linear transformation, unit to change. we ma may y actually wish to remov removee all linear relationships betw etween een units within a Because the final lay er of the netw ork is able to learn a linear transformation, la lay yer. Indeed, this is the approach taken by Desjardins et al. (2015), who provided w e ma y actually wish to remov e all linear relationships b etw een units within a the inspiration for batc batch h normalization. Unfortunately Unfortunately,, eliminating all linear la y er. Indeed, this is the approach taken b y Desjardins et al. ( 2015 ), who provided in interactions teractions is muc much h more exp expensiv ensiv ensivee than standardizing the mean and standard the inspiration batch normalization. , eliminating deviation of eac each hfor individual unit, and so far Unfortunately batch normalization remains all thelinear most in teractions is muc h more exp ensiv e than standardizing the mean and standard practical approach. deviation of each individual unit, and so far batch normalization remains the most Normalizing the mean and standard deviation of a unit can reduce the expressiv expressivee practical approach. power of the neural net network work con containing taining that unit. In order to main maintain tain the Normalizing andork, standard deviation a unit can thehidden expressiv expressiv expressive e pow ower erthe of mean the netw network, it is common toofreplace the reduce batch of unite p ower of the neural work containing that the unit. In orderHto 0+ 0 maintain the H with γ Hnet β rather activ activations ations than simply normalized . The variables expressiv e p ow er of the netw ork, it is common to replace the batch ofvehidden unit have any mean γ and β are learned parameters that allow the new variable to ha H γ H + β H activ ations with rather than simply the normalized . The v ariables and standard deviation. At first glance, this ma may y seem useless—why did we set parameters that allow the new ariable any mean γ and β are the mean to 0learned , and then in intro tro troduce duce a parameter that vallo allows ws ittotoha beveset bac back k to and standard deviation. At first glance, this ma y seem useless—why did we set an any y arbitrary value β ? The answ answer er is that the new parametrization can represent the same mean family to 0 , and then intro a parameter that allows it to be but set the backnew to the of functions ofduce the input as the old parametrization, β an y arbitrary v alue ? The answ er is that the new parametrization can represent parametrization has differen differentt learning dynamics. In the old parametrization, the the same of functionsby of athe input as the old parametrization, the new mean of Hfamily was determined complicated in interaction teraction betw etween een the but parameters parametrization has H differen learning dynamics. In thethe oldmean parametrization, in the lay layers ers below . In tthe new parametrization, of γ H 0 + βthe is mean of w as determined by a complicated in teraction b etw een the parameters H determined solely by β . The new parametrization is muc much h easier to learn with H in the lay ers b elow . In the new parametrization, the mean of γ H + β is gradien gradientt descent. determined solely by β . The new parametrization is much easier to learn with Most neural net network work lay layers ers take the form of φ(X W + b) where φ is some gradient descent. fixed nonlinear activ activation ation function suc such h as the rectified linear transformation. It W + b) wheretoφthe Most neural net work lay ers take the form batch of φ(Xnormalization is some is natural to wonder whether we should apply input fixed nonlinear activ ation function suc h as the rectified linear transformation. It X , or to the transformed value X W + b . Ioffe and Szegedy (2015) recommend is natural wonder whether apply normalization to theversion input W should + b should the latter. to More sp specifically ecifically ecifically, , Xwe bebatch replaced by a normalized X ,XorWto. the + b . Ioffe value X and Szegedy (2015 ) recommend of Thetransformed bias term should beW omitted because it becomes redundant with X W + b the latter. More sp ecifically , should b e replaced by a normalized version the β parameter applied by the batc batch h normalization reparametrization. The input X W of . The bias term should b e because it becomes redundant to a lay layer er is usually the output of omitted a nonlinear activ activation ation function such aswith the the β parameter applied by batch normalization reparametrization. rectified linear function in athe previous lay layer. er. The statistics of the inputThe areinput thus to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus 321
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
more non-Gaussian and less amenable to standardization by linear op operations. erations. In conv convolutional olutional net netw works, describ described ed in Chapter 9, it is imp important ortant to apply the more non-Gaussian and less amenable to standardization by linear operations. same normalizing µ and σ at every spatial lo location cation within a feature map, so that In conv olutional net w orks, describ ed in Chapter 9, it is imp apply the the statistics of the feature map remain the same regardless ofortant spatialtolo location. cation. same normalizing µ and σ at every spatial location within a feature map, so that the statistics of the feature map remain the same regardless of spatial location.
8.7.2
Co Coordinate ordinate Descent
8.7.2 ordinate Descent In some Co cases, it may b e possible to solve an optimization problem quickly by f (x) with resp breaking it into separate pieces. If we minimize minimizef respect ect to a single variable In some cases, it may b e p ossible to solve an optimization quickly by xi , then minimize it with resp respect ect to another variable xj andproblem so on, rep repeatedly eatedly f ( x breaking it into separate pieces. If w e minimize ) with resp ect to a single v ariable cycling through all variables, we are guaranteed to arrive at a (lo (local) cal) minimum. x , then x minimize it with ect to another v ariable and so on, rep eatedly This practice is known as coresp or desc b ecause we optimize one co ordinate dinate descent ent ent,, coordinate ordinate cycling through variables, we guaranteed to arrive a (lo cal) minimum. at a time. Moreallgenerally generally, , blo block ckare coor ordinate dinate desc descent ent refersatto minimizing with This practice is known as c o or dinate desc ent , b ecause we optimize one co ordinate resp respect ect to a subset of the variables simultaneously simultaneously.. The term “co “coordinate ordinate descen descent” t” at a time. More generally , blo ck c o or dinate desc ent refers to minimizing with is often used to refer to blo blocck co coordinate ordinate descent as well as the strictly individual resp ect to a subset of the v ariables simultaneously. The term “coordinate descent” co descent. coordinate ordinate is often used to refer to block coordinate descent as well as the strictly individual Co Coordinate ordinate descent makes the most sense when the different variables in the coordinate descent. optimization problem can be clearly separated into groups that pla play y relatively Co ordinate descent makes the most sense when the different v ariables in the isolated roles, or when optimization with respect to one group of variables is optimization problem can than be clearly separated into groups y relatively significan significantly tly more efficient optimization with resp respect ect tothat all ofpla the variables. isolated roles,consider or whentheoptimization F or example, cost functionwith respect to one group of variables is significantly more efficient than optimization with respect to all of the variables. 2 X X For example, consider the cost function > J (H , W ) = . (8.44) |Hi,j | + X−W H i,j
i,j
i,j
J (H , W ) = . (8.44) H + X W H This function describ describes es a learning |problem called sparse co coding, ding, where the goal is | − to find a weigh weightt matrix W that can linearly deco decode de a matrix of activ activation ation values This describ es a learning problem called sparse co ding, where the goal is H tofunction X reconstruct the training set . Most applications of sparse co coding ding also tovfind weigh matrix that deco matrix ation alues deofathe of activ Xcan linearly X in inv olv olvee aweigh weight t tdeca decay y or W a constraint on the norms columns of W , invorder H preven to reconstruct the training set X . Most applications of and sparse coW ding to prevent t the pathological solution with extremely small H large . also involve weight decay or a constraint on the norms of the columns of W , in order The function J is not conv convex. ex. Ho How wev ever, er, w wee can divide the inputs to the to prevent the pathological solution with extremely small H and large W . training algorithm in into to two sets: the dictionary parameters W and the code J The function is not convthe ex. ob Ho weveer, we can divide the toinputs represen representations tations H . Minimizing objectiv jectiv jective function with resp respect ect either to onethe of W training algorithm in to t wo sets: the dictionary parameters and the code these sets of variables is a conv convex ex problem. Block co coordinate ordinate descent thus gives H represen tations . Minimizing the ob jectiv e function with resp ectex to optimization either one of us an optimization strategy that allows us to use efficient conv convex these sets ofby variables is a bconv ex problem. Block coordinate descent thus gives W with H fixed, algorithms, alternating etw etween een optimizing then optimizing us an optimization H with W fixed. strategy that allows us to use efficient convex optimization algorithms, by alternating between optimizing W with H fixed, then optimizing Co Coordinate ordinate descent is not a very go goo od strategy when the value of one variable H with W fixed. strongly influences the optimal value of another variable, as in the function f (x ) = Coordinate descent is not a very good strategy when the value of one variable strongly influences the optimal value of 322 another variable, as in the function f (x ) =
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
(x1 − x 2 )2 + α x21 + x22 where α is a positive constan constant. t. The first term encourages the tw two o variables to ha hav ve similar value, while the second term encourages them (to x be xnear ) +zero. α x The + xsolution whereisα to is set a positive The firstmethod term encourages both toconstan zero. t. Newton’s can solve the tw o v ariables to ha v e similar v alue, while the second term encourages them − the problem in a single step because it is a positive definite quadratic problem. to b e near zero. The solution is to set b oth to zero. Newton’s method can Ho How wev ever, er, for small α, co coordinate ordinate descent will mak makee very slow progress becausesolve the the problem in not a single step because it is atopositive definite quadratic problem. first term do does es allo allow w a single v ariable b e changed to a v alue that differs α, coordinate descent will make very slow progress because the However, tly for from small significan significantly the current value of the other variable. first term does not allow a single variable to be changed to a value that differs significantly from the current value of the other variable.
8.7.3
Polyak Averaging
P oly olyak ak averaging Poly Polyak ak and Juditsky, 1992) consists of av averaging eraging together sev several eral 8.7.3 Polyak (A veraging poin oints ts in the tra trajectory jectory through parameter space visited b by y an optimization Polyak averaging (Polyak and Juditsky,descen 1992)tconsists of averaging sevthe eral (1) , . . . together θ , θ (t), then algorithm. If t iterations of gradient descent visit points P points in the Poly trajectory through parameter t) = 1visited (i) by an optimization output of the Polyak ak av averaging eraging algorithm is θˆ(space i θ . On some problem t t iterations θ , .this . . , θapproac algorithm. of gradient descen visit , thenh the classes, suchIf as gradient descen descent t applied to tconv convex expoints problems, approach has ˆ = θ θ output of the Poly ak av eraging algorithm is . On some problem strong conv convergence ergence guaran guarantees. tees. When applied to neural net netw works, its justification classes, as gradient t applied convex problems, approac h has is more such heuristic, but it descen performs well intopractice. The basicthis idea is that the strong conv ergence guaran tees. When applied to neural net w orks, its justification optimization algorithm may leap back and forth across a valley several times is more ever heuristic, but performs well in practice. is that without visiting a pitoint near the bottom of the vThe alley alley..basic The idea average of allthe of P a valley several times optimization algorithm may leap back and forth across the lo locations cations on either side should be close to the bottom of the valley though. without ever visiting a point near the bottom of the valley. The average of all of non-conv non-convex problems, the path by the the boptimization trajectory jectory can be the In locations onexeither side should be taken close to ottom of thetra valley though. very complicated and visit man many y different regions. Including poin oints ts in parameter In non-conv ex problems, the path taken b y the optimization tra be space from the distant past that ma may y be separated from the current jectory point bycan large vbarriers ery complicated and visit man y different regions. Including p oin ts in parameter in the cost function do does es not seem like a useful behavior. As a result, space applying from the distant that ma be separated from the it current pointtoby large when Poly Polyak ak past av averaging eraging toynon-conv non-convex ex problems, is typical use an barriers in the cost function do es not seem like a useful b ehavior. As a result, exp exponen onen onentially tially decaying running av average: erage: when applying Polyak averaging to non-convex problems, it is typical to use an exponentially decaying running θˆ(t) =av αerage: θˆ(t−1) + (1 − α)θ(t). (8.45) + (1 α)θ . θˆ = α θˆ (8.45) The running average approach is used in numerous applications. See Szegedy − et al. (2015) for a recent example. The running average approach is used in numerous applications. See Szegedy et al. (2015) for a recent example.
8.7.4
Sup Supervised ervised Pretraining
8.7.4 Sup ervised Pretraining Sometimes, directly training a mo model del to solve a sp specific ecific task can be to too o am ambitious bitious if the mo model del is complex and hard to optimize or if the task is very difficult. It is Sometimes, directly training a moadel to solve adel spto ecific task be then too am bitious sometimes more effective to train simpler mo model solve thecan task, make the if the del complex. is complexItand toe optimize or ife the task is very difficult. It is mo model delmo more can hard also b more effectiv effective to train the mo model del to solve a sometimes more effective to train a simpler mo del to solve the task, then make the simpler task, then mov movee on to confron confrontt the final task. These strategies that inv involve olve model more complex. It can also be more effective to train the model to solve a 323 final task. These strategies that involve simpler task, then move on to confront the
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
training simple mo models dels on simple tasks before confronting the challenge of training the desired mo model del to perform the desired task are collectively known as pr pretr etr etraining aining aining.. training simple models on simple tasks before confronting the challenge of training the desired model to perform the desired task are collectively known as pretraining. Gr Greeedy algorithms break a problem into man many y comp components, onents, then solve for the optimal version of each comp component onent in isolation. Unfortunately Unfortunately,, combining the Greedy algorithms break a problem man y comp onents, solve for the individually optimal comp components onents is notinto guaran guaranteed teed to yield an then optimal complete optimal vHow ersion of greedy each comp onent incan isolation. Unfortunately , hcombining the solution. However, ever, algorithms be computationally muc much cheaper than individuallythat optimal comp is not guaranand teedthe to quality yield anofoptimal algorithms solv solvee for theonents b est join joint t solution, a greedycomplete solution solution. How ever, greedy algorithms can b e computationally muc h cheaper is often acceptable if not optimal. Greedy algorithms ma may y also be follo followed wed than by a algorithms that for thea bjoin est tjoin t solution, and the quality of a greedy solution fine-tuning stagesolv in ewhich joint optimization algorithm searches for an optimal is often acceptable not optimal. Greedy may also algorithm be followed by aa solution to the full if problem. Initializing thealgorithms joint optimization with fine-tuning stagecan in greatly which asp join algorithm searches optimal greedy solution speed eedt optimization it up and improv improve e the quality of for the an solution it solution to the full problem. Initializing the joint optimization algorithm with a finds. greedy solution can greatly speed it up and improve the quality of the solution it Pretraining, and esp especially ecially greedy pretraining, algorithms are ubiquitous in finds. deep learning. In this section, we describ describee sp specifically ecifically those pretraining algorithms Pretraining, and esp ecially greedy pretraining, are ubiquitous in that break sup supervised ervised learning problems in into to otheralgorithms simpler supervised learning deep learning. this section, we describ specifically those pretraining algorithms problems. ThisInapproach is known as gr greeeedy sup supervise ervise ervised d pr pretr etr etraining aining aining.. that break supervised learning problems into other simpler supervised learning In the original (Bengio et al., 2007) version of greedy sup supervised ervised pretraining, problems. This approach is known as greedy supervised pretraining. eac each h stage consists of a sup supervised ervised learning training task inv involving olving only a subset of In the original ( Bengio et al. , 2007 ) version of greedy sup ervised the lay layers ers in the final neural net network. work. An example of greedy sup supervised ervisedpretraining, pretraining eac h stage consists of a sup ervised learning training task inv olving only subset of is illustrated in Fig. 8.7, in which eac each h added hidden la lay yer is pretrainedaas part of layers in the final neural network. example of greedy suppreviously ervised pretraining athe shallow sup supervised ervised MLP MLP,, taking as An input the output of the trained is illustrated in Fig. 8.7 , in which eac h added hidden la y er is pretrained as part of hidden lay layer. er. Instead of pretraining one lay layer er at a time, Simony Simonyan an and Zisserman shallow supervised , olutional taking asnetw input output of thelay previously trained (a2015 ) pretrain a deepMLP conv convolutional network orkthe (eleven weight layers) ers) and then use hidden lay er. Instead of pretraining one lay er at a time, Simony an and Zisserman the first four and last three la layers yers from this netw network ork to initialize ev even en deep deeper er (net 2015 ) pretrain convolutional ork (eleven weight laylay ers)ersand thennew, use netw works (with aupdeep to nineteen lay layers ersnetw of weigh weights). ts). The middle layers of the andare lastinitialized three layers from. this netwnetw ork ork to initialize evtly en trained. deeper vthe ery first deep four net network work randomly randomly. The new network is then join jointly networksoption, (with explored up to nineteen layal. ers(2010 of weigh ts).use The the new, Another by Yu et ) is to themiddle outputslay ofers theofpreviously vtrained ery deepMLPs, network are initialized randomly . The new netw ork is then join tly as well as the ra raw w input, as inputs for each added stage. trained. Another option, explored by Yu et al. (2010) is to use the outputs of the previously Wh Why y w would ould greedy supervised pretraining help? The h hyp yp ypothesis othesis initially trained MLPs, as well as the raw input, as inputs for each added stage. discussed by Bengio et al. (2007) is that it helps to provide better guidance to the Why would greedy supervised pretraining help? The hyp othesis in intermediate termediate levels of a deep hierarc hierarch hy. In general, pretraining ma may y help initially both in discussed by Bengio et al. ( 2007 ) is that it helps to provide b etter guidance to the terms of optimization and in terms of generalization. intermediate levels of a deep hierarchy. In general, pretraining may help both in An approach related to sup supervised ervised pretraining extends the idea to the context terms of optimization and in terms of generalization. of transfer learning: Yosinski et al. (2014) pretrain a deep conv convolutional olutional net with 8 An approach related to sup ervised pretraining extends the to the context la lay yers of weights on a set of tasks (a subset of the 1000 ImageNetidea ob object ject categories) of transfer learning:a Y osinski etnetw al. (ork 2014with ) pretrain a deep olutional with k la and then initialize same-size network the first lay yconv ers of the firstnet net. All8 layers of a setnetw of tasks subset theer1000 ImageNet ob ject categories) the lay layers ersweights of the on second network ork (a (with the ofupp upper lay layers ers initialized randomly) are and then initialize a same-size network with the first k layers of the first net. All the layers of the second network (with 324 the upper layers initialized randomly) are
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
y U(1) h(1)
h(1)
W (1)
W (1) x
x
(a)
(b)
U(1)
y
U(1)
y
y U(2) h(2) W (2)
h(2) U(2)
W (2)
y
h(1)
h(1) U(1)
W (1)
y
W (1)
x
x
(c)
(d)
Figure 8.7: Illustration of one form of greedy sup supervised ervised pretraining (Bengio et al., 2007). (a) We start by training a sufficiently shallo shallow w arc architecture. hitecture. (b) Another drawing of the et al.ork Figurearchitecture. 8.7: Illustration ofe one greedy supervised lay pretraining Bengio netw , 2007 same (c) W keepform onlyofthe input-to-hidden layer er of the (original network and). (a) (b) W e start by training a sufficiently shallo w arc hitecture. Another drawing of the discard the hidden-to-output la lay yer. We send the output of the first hidden la lay yer as input (c) same architecture. e keep only the input-to-hidden layer of with the original netw ork ande to another supervised W single hidden la layer yer MLP that is trained the same ob objectiv jectiv jective discard the hidden-to-output er. Weasend the hidden output of the yereated as input as the first netw network ork was, thuslayadding second lay layer. er. first Thishidden can b elarep repeated for to another supervised single hidden la yer MLP that is trained with the same ob jectiv as many lay layers ers as desired. (d) Another drawing of the result, view viewed ed as a feedforwarde as the firstTonetw ork was, thus adding a second hidden er. This can b eallrep eated for net network. work. further impro improve ve the optimization, we canlay jointly fine-tune the lay layers, ers, as many layat ersthe as end desired. of the result, viewed as a feedforward either only or at (d) eachAnother stage ofdrawing this pro process. cess. network. To further improve the optimization, we can jointly fine-tune all the layers, either only at the end or at each stage of this process. 325
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
then jointly trained to perform a different set of tasks (another subset of the 1000 ImageNet ob object ject categories), with fewer training examples than for the first set of then jointly trained to perform a different setwith of tasks (another subset of the 1000 tasks. Other approaches to transfer learning neural netw networks orks are discussed in ImageNet ob ject categories), with fewer training examples than for the first set of Sec. 15.2. tasks. Other approaches to transfer learning with neural networks are discussed in Another related line of work is the FitNets (Romero et al., 2015) approac approach. h. This Sec. 15.2. approac approach h begins by training a net network work that has lo low w enough depth and great enough Another line poferwlay orker) is the (Romero et al.netw , 2015 ) approac This width (num (numb brelated er of units layer) to bFitNets e easy to train. This network ork then bh. ecomes h begins by training aork, network that has enough depth and great enough aapproac te teacher acher for a second netw network, designated thelow student . The studen student t net network work is width (num b er of units p er lay er) to b e easy to train. This netw ork then b ecomes muc uch h deep deeper er and thinner (elev (eleven en to nineteen lay layers) ers) and would be difficult to train a teacher a second netw ork, designated the studentof . The student netw network is with SGDfor under normal circumstances. The training the student network ork is much deep er by andtraining thinner the (elevstudent en to nineteen ers)only and to would be difficult to train made easier netw network orklaynot predict the output for with SGD under normal circumstances. The training of the student netw ork is the original task, but also to predict the value of the middle la lay yer of the teacher made easier by training the student netw ork not only to predict the output for net netw work. This extra task provides a set of hints ab about out how the hidden lay layers ers the original task, to predict the value ofproblem. the middle layer of the teacher should be used andbut canalso simplify the optimization Additional parameters netwintroduced ork. This to extra task the provides seterofofhints abouter how the netw hidden ers are regress middlea lay layer the 5-lay 5-layer teac teacher her network orklay from should be used and the optimization dditional the middle lay layer er of can the simplify deep deeper er student netw network. ork. problem. How However, ever,Ainstead of parameters predicting are introduced to regress the middle lay er of the 5-lay er teac her netw ork lay from the final classification target, the ob objectiv jectiv jectivee is to predict the middle hidden layer er thethe middle layer of ork. the deep student netw ork.studen However, instead ofuspredicting of teac teacher her netw network. The er low lower er lay layers ers of the student t netw networks orks th thus hav havee tw two o the final classification target, the ob jectiv e is to predict the middle hidden lay er ob objectiv jectiv jectives: es: to help the outputs of the studen studentt netw network ork accomplish their task, as of the teac her netw ork. The low er lay ers of the studen t netw orks thus havae thin two well as to predict the in intermediate termediate lay layer er of the teacher net netw work. Although ob jectiv es: to help the outputs of the studen t netw ork accomplish their task, as and deep netw network ork app appears ears to be more difficult to train than a wide and shallow w ellwas tothe predict termediate er of the teacher netand work. Although thin net netw ork, thin the and in deep net netw worklay may generalize better certainly hasalow lower er and deep netw ork app ears to b e more difficult to train than a wide and shallow computational cost if it is thin enough to ha hav ve far fewer parameters. Without net w ork, the thin and deep net w ork may generalize etter and v certainly hasinlow er the hints on the hidden lay layer, er, the student netw network orkbperforms ery poorly the computational cost if it is thin enough to ha v e far fewer parameters. Without exp experimen erimen eriments, ts, both on the training and test set. Hin Hints ts on middle lay layers ers may th thus us the hints hidden laytrain er, the student very poorly in the b e one of on thethe to tools ols to help neural net netw wnetw orksork thatperforms otherwise seem difficult to exp erimen ts, b oth on the training and test set. Hin ts on middle lay ers may th us train, but other optimization techniques or changes in the arc architecture hitecture may also b e one of the to ols to help train neural net w orks that otherwise seem difficult to solv solvee the problem. train, but other optimization techniques or changes in the architecture may also solve the problem.
8.7.5
Designing Mo Models dels to Aid Optimization
T o impro improv vDesigning e optimization, bestto strategy is not alw always ays to impro improv ve the optimization 8.7.5 Mothe dels Aid Optimization algorithm. Instead, many improv improvements ements in the optimization of deep mo models dels hav havee T o impro v e optimization, the b est strategy is not alw ays to impro v e the optimization come from designing the mo models dels to be easier to optimize. algorithm. Instead, many improvements in the optimization of deep models have In from principle, we could use activ activation functions that increase and decrease in come designing the mo dels to ation be easier to optimize. jagged non-monotonic patterns. How Howev ev ever, er, this would make optimization extremely In principle, we could activ ation functions that andfamily decrease in difficult. In practice, it is use more imp important ortant to cho hoose oseincrease a mo model del that jagged patterns. Howaevper, this would make optimization extremely . Most is easynon-monotonic to optimize than to use ow owerful erful optimization algorithm it is more imp ortant to c ho ose a mo del family difficult. In practice, of the adv advances ances in neural netw network ork learning over the past 30 years ha have ve that been is easy to optimize than to use a powerful optimization algorithm. Most 326 of the advances in neural network learning over the past 30 years have been
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
obtained by changing the mo model del family rather than changing the optimization pro procedure. cedure. Sto Stochastic chastic gradient descent with momentum, which was used to train obtained by changing model familyinrather changing thethe optimization neural net networks works in thethe 1980s, remains use inthan mo modern dern state of art neural pro cedure. Sto chastic gradient descent with momentum, which was used to train net netw work applications. neural networks in the 1980s, remains in use in modern state of the art neural Sp Specifically ecifically ecifically,, mo modern dern neural net netw works reflect a design choic choicee to use linear transnetwork applications. formations betw etween een lay layers ers and activ activation ation functions that are differen differentiable tiable almost Sp ecifically , mo dern neural net w orks reflect a design choic e to use linearIntransev everywhere erywhere and hav havee significant slop slopee in large portions of their domain. parformations b etw een lay ers and activ ation functions that are differen tiable almost ticular, mo model del innov innovations ations like the LSTM, rectified linear units and maxout units ev erywhere and hav e significant e infunctions large portions of their mo domain. Indeep parha hav ve all mov moved ed to tow ward using moreslop linear than previous models dels like ticular, del innov ations likeunits. the LSTM, rectified linear units and maxout units net netw worksmo based on sigmoidal These mo models dels hav have e nice prop properties erties that make have all movedeasier. towardThe using moret linear thany previous models like optimization gradien gradient flo flows ws functions through man many la layer yer yerss provided thatdeep the net w orks based on sigmoidal units. These mo dels hav e nice prop erties that make Jacobian of the linear transformation has reasonable singular values. Moreo Moreover, ver, optimization easier. The gradien t flo ws through man y la yer s provided that the linear functions consistently increase in a single direction, so even if the mo model’s del’s Jacobian thefar linear has reasonable singular values. ver, output is of very fromtransformation correct, it is clear simply from computing theMoreo gradient linear functionsits consistently increase single the direction, so evenInif other the mo del’s whic which h direction output should mov moveeintoareduce loss function. words, output very far it is clear from gradient mo modern dernisneural netsfrom hav haveecorrect, been designed so simply that their lo locccomputing al gradient the information whic h direction its output should mov e to reduce the loss function. In other words, corresp corresponds onds reasonably well to movin moving g tow toward ard a distant solution. modern neural nets have been designed so that their local gradient information Other mo model del design strategies can help to make optimization easier. For corresponds reasonably well to moving toward a distant solution. example, linear paths or skip connections bet etween ween lay layers ers reduce the length of Other mo del design strategies can help to make optimization easier. Fus or the shortest path from the low lower er lay layer’s er’s parameters to the output, and th thus example, linear paths or skip connections b et ween lay ers reduce the length of mitigate the vanishing gradient problem (Sriv Srivasta asta astav va et al. al.,, 2015). A related idea theskip shortest path from the low er lay er’s of parameters theare output, andtoththe us to connections is adding extra copies the outputtothat attached mitigate the hidden vanishing gradient Sriv vogLeNet a et al., (2015 ). Aetrelated idea) in intermediate termediate la layers yers of the problem netw network, ork,(as inasta Go GoogLeNet Szegedy al., 2014a to skip connections is adding extra copies output that are attached to the and deeply-sup deeply-supervised ervised nets (Lee et al. , 2014of). the These “auxiliary heads” are trained in termediate hidden la yers of the netw ork, as in Go ogLeNet ( Szegedy et al. , 2014a to perform the same task as the primary output at the top of the netw network ork in order) andensure deeply-sup ervised (Lee et al.,a2014 These “auxiliary heads” is arecomplete trained to that the lo low wernets lay layers ers receive large). gradient. When training to pauxiliary erform theheads same may task b ase the primary This output at the top of the in order the discarded. is an alternative tonetw the ork pretraining to ensure that the werintroduced layers receive a large gradient. WhenIntraining strategies, whic which h wloere in the previous section. this waisy,complete one can the auxiliary heads may b e discarded. This is an alternative to the pretraining train jointly all the lay layers ers in a single phase but change the architecture, so that strategies, h ere ecially introduced in er theones) previous section. In this waywhat , one they can in intermediate termediatewhic lay layers ersw(esp (especially the low lower can get some hints ab about out train jointly layers in aThese single hints phaseprovide but change the signal architecture, that should do, viaalla the shorter path. an error to low lower ersolay layers. ers. intermediate layers (especially the lower ones) can get some hints about what they should do, via a shorter path. These hints provide an error signal to lower layers.
8.7.6
Con Continuation tinuation Metho Methods ds and Curriculum Learning
8.7.6 tinuation Metho and Curriculum Learning As arguedCon in Sec. 8.2.7, man many y of theds challenges in optimization arise from the global structure of the cost function and cannot be resolv resolved ed merely by making better As argued of in lo Sec. , man y of the challenges in optimization the global estimates local cal 8.2.7 up update date directions. The predominant strategyarise for ofrom vercoming this structure of the cost function and cannot b e resolv ed merely by making better problem is to attempt to initialize the parameters in a region that is connected estimates of local The predominant strategy overcoming to the solution byupadate shortdirections. path through parameter space thatfor lo local cal descent this can problem is to attempt to initialize the parameters in a region that is connected 327parameter space that lo cal descent can to the solution by a short path through
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
disco discov ver. Continuation metho methods ds are a family of strategies that can mak makee optimization discover. easier by cho hoosing osing initial points to ensure that lo local cal optimization sp spends ends most of Continuation metho ds are a family of strategies that can mak e optimization its time in well-b ell-beha eha ehaved ved regions of space. The idea behind contin continuation uation metho methods ds is easier by c ho osing initial p oints to ensure that lo cal optimization sp ends most of to construct a series of ob objective jective functions ov over er the same parameters. In order to its time ina w ell-bfunction ehaved regions space. The idea ehindfunctions continuation (n)}is J (θ ), weofwill . . . , J ds {J (0),metho minimize cost construct newbcost . to construct a series of ob jective functions ov er the same parameters. In order to (0) These cost functions are designed to be increasingly difficult, with J being fairly , . function ..,J minimize a cost function ), wemost will construct . (n()θ J(θfunctions easy to minimize, and J J , the difficult, bnew eingcost ), the trueJcost J These cost functions are designed to b e increasingly difficult, with b eing fairly { } motiv motivating ating the entire pro process. cess. When we sa say y that J (i) is easier than J (i+1) , we J J ( θ easy to minimize, and , the most difficult, b eing ) , the true cost function mean that it is well behav ehaved ed ov over er more of θ space. A random initialization is more motiv ating the entire pro cess. When say that is easierthe than , we J minimize J function lik likely ely to land in the region where lo local calwedescent can cost mean that itbis well bthis ehavregion ed overis more of θThe space. A of random initialization is more successfully ecause larger. series cost functions are designed lik ely to land in the region where lo cal descent can minimize the cost function so that a solution to one is a go goo od initial poin ointt of the next. We thus begin by successfully because this region is larger. seriestoof solv coste functions are designed solving an easy problem then refine the The solution solve incrementally harder so that a solution to one is a go o d initial p oin t of the next. W e thus b egin by problems until we arrive at a solution to the true underlying problem. solving an easy problem then refine the solution to solve incrementally harder Traditional contin continuation uation metho methods ds (predating the use of contin continuation uation metho methods ds problems until we arrive at a solution to the true underlying problem. for neural net network work training) are usually based on smo smoothing othing the ob objectiv jectiv jectivee function. raditional metho (predating the useaofreview continof uation ds See TW u (1997) contin for anuation example of ds such a metho method d and somemetho related for neural workuation training) areds usually based on smo othing ob jectivannealing, e function. metho methods. ds. net Contin Continuation metho methods are also closely related tothe sim simulated ulated See W (1997 ) fortoanthe example of such a methodk and a ,review some related whic which h uadds noise parameters (Kirkpatric Kirkpatrick et al. al., 1983).of Contin Continuation uation metho ds. Contin uation metho ds are also closely related to sim ulated annealing, metho methods ds hav havee been extremely successful in recent years. See Mobahi and Fisher to the parameters (Kirkpatric k etforal.AI , 1983 ). Continuation (whic 2015h) adds for annoise ov overview erview of recent literature, esp especially ecially applications. methods have been extremely successful in recent years. See Mobahi and Fisher Con Contin tin tinuation uation metho methods traditionally mostlyfor designed with the goal of (2015 ) for an overview ofdsrecent literature,were especially AI applications. overcoming the challenge of lo local cal minima. Sp Specifically ecifically ecifically,, they were designed to Con tin uation metho ds traditionally were mostly with the of reac reach h a global minim minimum um despite the presence of man many ydesigned lo local cal minima. Togoal do so, overcoming the challenge localconstruct minima. easier Specifically , they were designedthe to these contin continuation uation metho methods dsof would cost functions by “blurring” reach a cost global minimum despite theop presence of man y local To do so, original function. This blurring operation eration can be done byminima. approximating these continuation methods would construct easier cost functions by “blurring” the ) original cost function. This eration can be 0 done by approximating J (iblurring (θ) = Eθop (8.46) 0 ∼N (θ 0;θ,σ(i)2 )J (θ ) E J for (θ)this = approach is that J (θsome ) (8.46) via sampling. The intuition non-conv non-convex ex functions become approximately con convex vex when blurred. In man many y cases, this blurring preserves via sampling. The intuition for this approach is that some non-conv enough information ab about out the lo location cation of a global minimum that we ex canfunctions find the b ecome approximately con vex when blurred. In man y cases, this blurring preserves global minim minimum um by solving progressively less blurred versions of the problem. This enough information aboutinthe location of awglobal minimum that we can find the approac approach h can break down three different ays. First, it might successfully define global umfunctions by solvingwhere progressively blurred versions of theum problem. This a seriesminim of cost the firstless is con the optim tracks from convex vex and optimum approac h can to break in three at different ways.minimum, First, it might define one function the down next arriving the global but itsuccessfully might require so a series of cost functions where the first is con vex and the optim um tracks from man many y incremental cost functions that the cost of the entire pro procedure cedure remains high. one function to the next arriving at the global minimum, but it might so NP-hard optimization problems remain NP-hard, even when contin continuation uationrequire metho methods ds many incremental cost functions that the cost of the entire procedure remains high. NP-hard optimization problems remain 328 NP-hard, even when continuation methods
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
are applicable. The other two ways that con contin tin tinuation uation metho methods ds fail both corresp correspond ond to the metho method d not being applicable. First, the function migh mightt not become con convex, vex, arematter applicable. Thehother two waysConsider that confor tinuation metho ds fail both > θ. no how muc much it is blurred. example the function J ( θcorresp ) = −θond to the metho d not being First, function might not become convex, Second, the function ma may yapplicable. become conv convex ex asthe a result of blurring, but the minimum no matter how muc h it is blurred. Consider for example the function J ( θ ) = θ. of this blurred function may trac track k to a lo local cal rather than a global minimum ofθ the Second, the − original costfunction function.may become convex as a result of blurring, but the minimum of this blurred function may track to a local rather than a global minimum of the Though contin continuation uation metho methods ds were mostly originally designed to deal with the original cost function. problem of lo local cal minima, lo local cal minima are no longer believ elieved ed to be the primary Though contin uation metho ds were mostly originally designed to deal with the problem for neural netw network ork optimization. Fortunately ortunately,, contin continuation uation metho methods ds can problem local minima, localfunctions minimain are no longer believ ed to be the primary still help. ofThe easier ob objective jective intro tro troduced duced by the con contin tin tinuation uation metho method d can problem for neural netw ork optimization. F ortunately , contin uation metho ds can eliminate flat regions, decrease variance in gradien gradientt estimates, impro improv ve conditioning stillthe help. The easier ob jective troduced by the contin uation metho d can of Hessian matrix, or do functions anythinginelse that will either make lo local cal up updates dates eliminate flat regions, decrease v ariance in gradien t estimates, impro v e conditioning easier to compute or improv improvee the corresp correspondence ondence bet etween ween lo local cal up update date directions of the Hessian matrix, or do solution. anything else that will either make local updates and progress to tow w ard a global easier to compute or improve the correspondence between local update directions et to al.ward (2009 observed that an approach called curriculum le learning arning or andBengio progress a )global solution. shaping can be interpreted as a contin continuation uation method. Curriculum learning is based Bengio et al. ( 2009 ) observed that an curriculum learning or on the idea of planning a learning pro process cessapproach to begincalled by learning simple concepts shaping can betointerpreted as a contin uation method. Curriculum learning based and progress learning more complex concepts that dep depend end on these is simpler on the idea of basic planning a learning process known to begin y learningprogress simple in concepts concepts. This strategy was previously to baccelerate animal and progress to learning more complex concepts that dep end on these simpler training (Skinner, 1958; Peterson, 2004; Krueger and Day Dayan an, 2009) and mac machine hine concepts. This basic strategy was previously known to accelerate progress in animal learning (Solomonoff, 1989; Elman, 1993; Sanger, 1994). Bengio et al. (2009) training this (Skinner , 1958 , 2004 ; Krueger andearlier Dayan ) and easier machine J ,(i)2009 justified strategy as ;a Peterson contin continuation uation method, where are made by learning ( Solomonoff , 1989 ; Elman , 1993 ; Sanger , 1994 ). Bengio et al. ( 2009 increasing the influence of simpler examples (either by assigning their con contributions tributions) justified thisfunction strategylarger as a contin uation method, where earlier arefrequently), made easierand by to the cost co coefficients, efficients, or by sampling them Jmore increasing the influence of simpler examples (either by assigning their con tributions exp experimen erimen erimentally tally demonstrated that better results could b e obtained by follo following wing a to the cost function larger coneural efficients, or by sampling frequently), and curriculum on a large-scale language mo modeling delingthem task. more Curriculum learning experimen tally demonstrated better resultslanguage could b e(obtained byetfollo has been successful on a widethat range of natural Spitko Spitkovsky vsky al. al.,wing , 2010a; curriculum large-scale neural language mo deling task. learning Collob Collobert ert et on al. al.,,a2011a ; Mikolo Mikolov v et al. al., , 2011b; T u and Honav Honavar ar,Curriculum 2011) and computer has b een successful on a wide range of natural language ( Spitko vsky et al. 2010); vision (Kumar et al., 2010; Lee and Grauman, 2011; Supancic and Ramanan,, 2013 Collobert et al., 2011a ; Mikolo al., verified 2011b; Tas u and Honav ar, 2011with ) andthe computer tasks. Curriculum learning wvasetalso being consistent wa way y in vision ( Kumar et al. , 2010 ; Lee and Grauman , 2011 ; Supancic and Ramanan , 2013 whic which h humans te teach ach (Khan et al., 2011): teachers start by showing easier and) tasks.prototypical Curriculum examples learning wand as also aslearner being consistent the surface way in more thenverified help the refine the with decision whic h humans te ach ( Khan et al. , 2011 ): teachers start by showing easier with the less obvious cases. Curriculum-based strategies are mor moree effe effective ctiveand for more prototypical examples and then help the learner refine the decision surface teac humans than strategies based on uniform sampling of examples, and can teaching hing with the less the obvious cases. Curriculum-based strategies(Basu are mor e Christensen effective for, also increase effectiveness of other teaching strategies and teaching 2013 2013). ). humans than strategies based on uniform sampling of examples, and can also increase the effectiveness of other teaching strategies (Basu and Christensen, Another imp important ortant contribution to research on curriculum learning arose in the 2013). con context text of training recurren recurrentt neural netw networks orks to capture long-term dep dependencies: endencies: Another important contribution to research on curriculum learning arose in the context of training recurrent neural networks to capture long-term dependencies: 329
CHAPTER 8. OPTIMIZATION FOR TRAINING DEEP MODELS
Zarem Zaremba ba and Sutskev Sutskever er (2014) found that muc much h better results were obtained with a sto stochastic chastic curriculum curriculum,, in which a random mix of easy and difficult examples is Zarem and Sutskev er (2014 ) found muc h bav etter results were obtained with alw alwa aysbapresen presented ted to the learner, but that where the average erage prop proportion ortion of the more a stochastic curriculum in which random mix of easy andisdifficult examples is difficult examples (here, ,those with alonger-term dep dependencies) endencies) gradually increased. alw a ys presen ted to the learner, but where the av erage prop ortion of the more With a deterministic curriculum, no impro improvemen vemen vementt ov over er the baseline (ordinary difficult examples those with dependencies) is gradually increased. training from the (here, full training set)longer-term wa wass observed. With a deterministic curriculum, no improvement over the baseline (ordinary We hav havee no now w describ described ed the basic family of neural netw network ork mo models dels and ho how w to training from the full training set) was observed. regularize and optimize them. In the chapters ahead, we turn to sp specializations ecializations of W e hav e no w describ ed the basic family of neural netw ork mo dels andsizes howand to the neural netw network ork family family,, that allo allow w neural net networks works to scale to very large regularize anddata optimize them. In the chaptersThe ahead, we turn tometho specializations of pro process cess input that has sp special ecial structure. optimization methods ds discussed thethis neural network , that allo w neural net to ecialized scale to varchitectures ery large sizeswith and in chapter are family often directly applicable to works these sp specialized process data that has special structure. The optimization methods discussed little or input no mo modification. dification. in this chapter are often directly applicable to these specialized architectures with little or no modification.
330
Chapter 9 Chapter 9
Con Conv volutional Net Netw works Con v olutional Net w orks Convolutional networks (LeCun, 1989), also known as convolutional neur neural al networks or CNNs, are a sp specialized ecialized kind of neural net netw work for pro processing cessing data that has networks (LeCun , 1989), alsoinclude known time-series as convolutional al networks aConvolutional kno known, wn, grid-like topology topology. . Examples data,neur which can be or CNNs , are a sp ecialized kind of neural net w ork for pro cessing data that has though thoughtt of as a 1D grid taking samples at regular time in interv terv tervals, als, and image data, a kno wn, grid-like topology . Examples include time-series data, which can be whic which h can b e thought of as a 2D grid of pixels. Con Conv volutional net netw works hav havee b een thought of as successful a 1D grid taking samples at regular The timename interv“con als,vand imageneural data, tremendously in practical applications. “conv olutional whic can indicates b e thought of as 2D grid pixels. volutional netop works havecalled b een net netw whork” that thea netw network orkofemplo employs ys Con a mathematical operation eration successful applications. The operation. name “convCon olutional neural ctremendously onvolution onvolution.. Con Conv volutioninis practical a sp specialized ecialized kind of linear Conv volutional net w that neural the netw ork emplothat ys a mathematical op eration called net netw work” orksindicates are simply net netw works use conv convolution olution in place of Con v olutional cgeneral onvolution . Con v olution is a sp ecialized kind of linear operation. matrix multiplication in at least one of their lay layers. ers. networks are simply neural networks that use convolution in place of In this chapter, w wee will first in describe what convolution is.ers. Next, w wee will general matrix multiplication at least oneconv of olution their lay explain the motiv motivation ation b ehind using conv convolution olution in a neural netw network. ork. We will In this c hapter, w e will first describe what conv olution is. Next, weworks will then describ describee an op operation eration called pooling oling,, whic which h almost all conv convolutional olutional net netw explain motiv ation b ehind using olution in a neural ork. Wes e will emplo employ y. the Usually Usually, , the op operation eration used inconv a conv convolutional olutional neural netw netw network ork do does not then describ e an op eration pooling whicolution h almostasall conv netwsuc orks corresp correspond ond precisely to thecalled definition of, conv convolution used inolutional other fields such h emplo y . Usually , the op eration used in a conv olutional neural netw ork do es not as engineering or pure mathematics. We will describe sev several eral variants on the corresp ond precisely to the definition of conv olution as used in other fields con conv volution function that are widely used in practice for neural netw networks. orks. suc Whe as engineering pure mathematics. Weapplied will describe sev the will also sh sho ow or how conv convolution olution ma may y be to man many y eral kindsvariants of data,onwith con volution function that are widely used discuss in practice forofneural netw orks. We differen different t num umb b ers of dimensions. W Wee then means making con conv volution will also shot.w Con howvolutional convolution y be applied y kinds of data, with more efficien efficient. Conv net netw wma orks stand out astoanman example of neuroscien neuroscientific tific differen t n um b ers of dimensions. W e then discuss means of making con v olution principles influencing deep learning. We will discuss these neuroscientific principles, more conclude efficient. with Convcommen olutionaltsnet w orks as an example of neuroscien tific then comments ab about out thestand role out conv convolutional olutional netw networks orks hav havee play played ed principles influencing learning. e willthis discuss these neuroscientific in the history of deepdeep learning. One W topic chapter do does es not address principles, is how to conclude with commen ts ab outolutional the role net conv olutional netwof orks hav e played cthen ho hoose ose the arc architecture hitecture of your conv convolutional netw work. The goal this chapter is in the history of deep learning. One topic this c hapter do es not address is how to to describ describee the kinds of to tools ols that con conv volutional netw networks orks pro provide, vide, while Chapter 11 cho ose the architecture of your convolutional network. The goal of this chapter is to describ e the kinds of to ols that convolutional networks provide, while Chapter 11 331 331
CHAPTER 9. CONVOLUTIONAL NETWORKS
describ describes es general guidelines for cho hoosing osing which tools to use in whic which h circumstances. Researc Research h in into to con conv volutional net netw work archi architectures tectures pro proceeds ceeds so rapidly that a new describ eshitecture general guidelines chohmark osing which tools to ev use whic circumstances. b est arc architecture for a giv given enfor b enc enchmark is announced every eryinfew wheeks to months, Researc h in to con v olutional net w ork archi tectures pro ceeds so rapidly that a new rendering it impractical to describ describee the b est architecture in prin print. t. Ho How wev ever, er, the b est arc hitecture for a giv en b enc hmark is announced ev ery few w eeks to months, b est architectur architectures es hav havee consisten consistently tly b een comp composed osed of the building blocks describ described ed rendering it impractical to describ e the b est architecture in print. However, the here. b est architectures have consistently b een comp osed of the building blocks describ ed here.
9.1
The Con Conv volution Op Operation eration
In con is an op conv volution operation eration on tw two o functions of a real9.1its most Thegeneral Conform, volution Operation valued argument. To motiv motivate ate the definition of con conv volution, we start with examples In its most general form, con v olution is an op eration on two functions of a realof two functions we migh mightt use. valued argument. To motivate the definition of convolution, we start with examples Supp Suppose ose we are tracking the lo location cation of a spaceship with a laser sensor. Our of two functions we might use. laser sensor provides a single output x(t), the p osition of the spaceship at time Suppxose wet are the i.e., lo cation of get a spaceship with a laser sensor. Our t. Both and are tracking real-v real-valued, alued, we can a different reading from the laser laser sensor a single sensor at an any yprovides instan instantt in time. output x(t), the p osition of the spaceship at time t. Both x and t are real-valued, i.e., we can get a different reading from the laser No Now w that laser sensor is somewhat noisy noisy.. To obtain a less noisy sensor atsuppose any instan t inour time. estimate of the spaceship’s p osition, we would like to av average erage together sev several eral No w suppose that our laser sensor is somewhat noisy . To relev obtain lesswenoisy measuremen measurements. ts. Of course, more recent measuremen measurements ts are more relevan an ant, t,a so will estimate of the spaceship’s p osition, we w ould like to av erage together sev eral wan antt this to b e a weigh eighted ted av average erage that gives more weigh weightt to recent measuremen measurements. ts. measuremen ts. Of course, more recent measuremen ts are more relev an t, so w e will We can do this with a weigh weighting ting function w(a), where a is the age of a measuremen measurement. t. wan tosuc b e ha awweigh eighted erage gives more weighmoment, t to recent ts. If wet this apply such weighted tedavav average eragethat op operation eration at every wemeasuremen obtain a new w(of a),the a is the We can dos this with aaweigh ting function where a measurement. function pro providing viding smo smoothed othed estimate position of age the of spaceship: If we apply such a weighted averageZ op eration at every moment, we obtain a new function s providing a smo othed of the spaceship: (9.1) s(t) =estimate x(a)wof(tthe − aposition )da s(t) = x(a)w (t a)da (9.1) This op operation eration is called convolution onvolution.. The conv convolution olution op operation eration is typically − denoted with an asterisk: This op eration is called convolution s(t) = (x. ∗The w )(tconv ) olution op eration is typically (9.2) denoted with an asterisk: Z x wprobability )(t) In our example, w needs to sb(te) a= v(alid density function, or(9.2) the output is not a weigh weighted ted av average. erage. Also, w needs to b e 0 for all negative argumen arguments, ts, ∗ w In our example, needs to b e a v alid probability density function, or the or it will lo look ok into the future, which is presumably b ey eyond ond our capabilities. These w output is not a weigh ted av erage. Also, needs to b e 0 for allcon negative ts, limitations are particular to our example though. In general, conv volutionargumen is defined or itan will lo ok intoforthe future, presumably b eyond capabilities. for any y functions which thewhich ab abov ov ovee isintegral is defined, andour may b e used forThese other limitations are particular to ourted example though. In general, convolution is defined purp purposes oses besides taking weigh weighted av averages. erages. for any functions for which the ab ove integral is defined, and may b e used for other In con conv volutional net netw work terminology terminology,, the first argumen argumentt (in this example, the purp oses besides taking weighted averages. function x) to the con conv volution is often referred to as the input and the second In convolutional network terminology, the first argument (in this example, the function x) to the convolution is often332 referred to as the input and the second
CHAPTER 9. CONVOLUTIONAL NETWORKS
argumen argumentt (in this example, the function w) as the kernel. The output is sometimes referred to as the fe featur atur aturee map map.. argument (in this example, the function w) as the kernel. The output is sometimes In our example, the idea of a laser sensor that can pro provide vide measuremen measurements ts referred to as the feature map. at every instant in time is not realistic. Usually Usually,, when we work with data on a In our example, the idea of a laser sensor thatwill canprovide providedata measuremen ts computer, time will b e discretized, and our sensor at regular at every time is not realistic. Usually , whentoweassume work with on a in interv terv tervals. als.instant In ourinexample, it might b e more realistic that data our laser computer, time will b e discretized, and our will provide data at on regular t can then pro provides vides a measurement once p er second. Thesensor time index take only in terv als. In our example, it might b e more realistic to assume that our laser in integer teger values. If we now assume that x and w are defined only on in integer teger t, we t pro vides a measurement once p er second. The time index can then take on only can define the discrete con conv volution: integer values. If we now assume that x and w are defined only on integer t, we ∞ X can define the discrete convolution: s(t) = (x ∗ w )(t) = x(a)w (t − a) (9.3) a=−∞
s(t) = (x w )(t) = x(a)w (t a) (9.3) In mac machine hine learning applications, the input is usually−a multidimensional arra array y ∗ of data and the kernel is usually a multidimensional array of parameters that are In mac input is to usually ultidimensionalarra arraysy adapted byhine the learning learning applications, algorithm. Wthe e will refer theseammultidimensional arrays X of data andBecause the kernel is elemen usuallyt aofmultidimensional array of parameters that are as tensors. each element the input and kernel must b e explicitly stored adapted by, we the usually learningassume algorithm. e willfunctions refer to these multidimensional arra ys separately separately, thatW these are zero everywhere but the as tensors. each elemen of thethe input and This kernelmeans must that b e explicitly stored finite set ofBecause points for which wetstore values. in practice we separately , we usually assume that these functions are zero everywhere but the can implement the infinite summation as a summation ov over er a finite num number ber of finite of points for which we store the values. This means that in practice we arra array y set elemen elements. ts. can implement the infinite summation as a summation over a finite number of Finally, , wts. e often use con conv volutions ov over er more than one axis at a time. For arraFinally y elemen example, if we use a tw two-dimensional o-dimensional image I as our input, we probably also wan wantt Finally , w e often use con v olutions ov er more than one axis at a time. For to use a two-dimensional kernel K : example, if we use a two-dimensional X image XI as our input, we probably also want to use a two-dimensional ernel S (i, j ) = (I ∗kK )(i, jK ) := I (m, n)K (i − m, j − n). (9.4) m
n
S (i, j ) = (I K )(i, j ) = I (m, n)K (i m, j n). (9.4) Con Conv volution is comm commutativ utative, e, meaning we can equiv equivalently alently write: ∗utativ − − XX Convolution weI (can alently write: S (i,isj )comm = (Kutativ ∗ I )(i,e,j )meaning = i − equiv m, j − n)K (m, n). (9.5) X X m n S (i, j ) = (K I )(i, j ) = I (i m, j n)K (m, n). (9.5) Usually the latter formula is more straightforw straightforward implementt in a machine ∗ − ard to − implemen learning library library,, b ecause there is less variation in the range of valid values of m Usually the latter formula is more straightforward to implement in a machine and n. X learning library, b ecause there is lessX variation in the range of valid values of m The commutativ commutativee prop property erty of con conv volution arises b ecause we hav havee flipp flippeed the and n. kernel relativ relativee to the input, in the sense that as m increases, the index into the The commutativ prop erty into of con volution arises b ecause e hav e flippto ed flip the input increases, but ethe index the kernel decreases. Thewonly reason m k ernel relativ e to the input, in the sense that as increases, the index into the the kernel is to obtain the commutativ commutativee property property.. While the commutativ commutativee prop propert ert erty y input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative 333 property. While the commutative prop erty
CHAPTER 9. CONVOLUTIONAL NETWORKS
is useful for writing pro proofs, ofs, it is not usually an imp important ortant prop property erty of a neural net netw work implemen implementation. tation. Instead, many neural netw network ork libraries implement a is usefulfunction for writing pro ofs,cr it is orr notelation usually an imp prop of a neural related called the cross-c oss-c oss-corr orrelation elation, , which is ortant the same as erty conv convolution olution but net w ork implemen tation. Instead, many neural netw ork libraries implement a without flipping the kernel: related function called the cross-correlation, which is the same as convolution but XX without flipping I (i + m, j + n)K (m, n). (9.6) S (i, the j ) =kernel: (I ∗ K )(i, j ) = m
n
I (i + m, j + n)K (m, n). (9.6) S (i, j ) = (I K )(i, j ) = Man Many y machine learning libraries implement cross-correlation but call it conv convolution. olution. ∗ In this text we will follow this conv conven en ention tion of calling b oth op operations erations con conv volution, Man y machine learning libraries implement cross-correlation but call it conv and sp specify ecify whether we mean to flip the kernel or not in contexts whereolution. kernel In this text weant. will In follow this conv tion of calling b oth erationsalgorithm convolution, X flipping is relev relevant. the context ofenX mac machine hine learning, theop learning will and sp ecify whether w e mean to flip the k ernel or not in contexts where k ernel learn the appropriate values of the kernel in the appropriate place, so an algorithm flippingonisconv relev ant. Inwith the context of machine learning, the learning algorithm will based convolution olution kernel flipping will learn a kernel that is flipp flipped ed relative learn the appropriate v alues of the kernel in the appropriate place, so an algorithm to the kernel learned by an algorithm without the flipping. It is also rare for based on conv withalone kernel learn a kernel is olution flipp ed relative con conv volution toolution b e used in flipping machinewill learning; insteadthat conv convolution is used to the kernel learned by an algorithm without the flipping. It is also rare for sim simultaneously ultaneously with other functions, and the combination of these functions do does es con v olution to b e used alone in machine learning; instead conv olution is used not commute regardless of whether the conv convolution olution op operation eration flips its kernel or sim ultaneously with other functions, and the combination of these functions do es not. not commute regardless of whether the convolution op eration flips its kernel or See Fig. 9.1 for an example of con conv volution (without kernel flipping) applied to not. a 2-D tensor. See Fig. 9.1 for an example of convolution (without kernel flipping) applied to Discrete con conv volution can b e viewed as multiplication by a matrix. How However, ever, the a 2-D tensor. matrix has several entries constrained to b e equal to other en entries. tries. For example, condiscrete volutionconv can olution, b e viewed ash mro ultiplication by a matrix. However, for Discrete univ univariate ariate convolution, eac each row w of the matrix is constrained to the be matrix has several entries constrained to b e equal to other en tries. F or example, equal to the row ab aboove shifted by one elemen element. t. This is known as a Toeplitz matrix matrix.. for univ ariate discrete conv olution, eac h ro w of the matrix is constrained to b e In tw two o dimensions, a doubly blo block ck cir circulant culant matrix corresp corresponds onds to conv convolution. olution. equal to the row ab o v e shifted by one elemen t. This is known as a T o eplitz matrix In addition to these constrain constraints ts that several elements b e equal to each other,. In tw o dimensions, a doubly blotocka cir culant matrix ondswhose to conv olution. con conv volution usually corresponds very sparse matrixcorresp (a matrix entries are In addition to these constrain ts that several elements b e equal to each other, mostly equal to zero). This is b ecause the kernel is usually muc much h smaller than the con v olution usually corresponds to a v ery sparse matrix (a matrix whose entries are input image. Any neural netw network ork algorithm that works with matrix multiplication mostly equal to zero). This is b ecause the kernel is usually muc h smaller thanwork the and do does es not dep depend end on sp specific ecific prop properties erties of the matrix structure should input conv image. Any neural netw ork algorithm that works withtomatrix multiplication with convolution, olution, without requiring an any y further changes the neural netw network. ork. and do es not dep end on sp ecific prop erties of the matrix structure should work Typical conv convolutional olutional neural netw networks orks do make use of further sp specializations ecializations in with conv olution, without requiring an y further c hanges to the neural netw ork. order to deal with large inputs efficiently efficiently,, but these are not strictly necessary from T ypical conv olutional neural netw orks do make use of further sp ecializations in a theoretical p erspective. order to deal with large inputs efficiently, but these are not strictly necessary from a theoretical p erspective.
334
CHAPTER 9. CONVOLUTIONAL NETWORKS
Input Kernel a
b
c
d
e
f
g
h
i
j
k
l
w
x
y
z
Output
aw ey
+ +
bx fz
+
bw fy
+ +
cx gz
+
cw gy
+ +
dx hz
+
ew iy
+ +
fx jz
+
fw jy
+ +
gx kz
+
gw ky
+ +
hx lz
+
Figure 9.1: An example of 2-D conv convolution olution without kernel-flipping. In this case we restrict the output to only positions where the kernel lies entirely within the image, called “v “valid” alid” Figure 9.1: An example of 2-D conv olution without kernel-flipping. In this case w e restrict con conv volution in some contexts. We dra draw w b oxes with arrows to indicate how the upp upper-left er-left the output positions where the kernel lies entirely calledonding “valid” elemen element t of to theonly output tensor is formed by applying thewithin kernelthe toimage, the corresp corresponding conver-left olutionregion in some contexts. We draw b oxes with arrows to indicate how the upp er-left upp upper-left of the input tensor. element of the output tensor is formed by applying the kernel to the corresp onding upp er-left region of the input tensor.
335
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.2
Motiv Motivation ation
Con Conv leverages importan ortan ortantt ideas that can help improv improvee a machine 9.2volution Motiv ationthree imp learning system: sp sparse arse inter interactions actions, par arameter ameter sharing and equivariant repr epresentaesentaCon v olution leverages three imp ortan t ideas that can help improv e a machine tions tions.. Moreov Moreover, er, conv convolution olution pro provides vides a means for working with inputs of variable learning system: sp arse inter actions parameter sharing and equivariant representasize. We now describ describee each of these, ideas in turn. tions. Moreover, convolution provides a means for working with inputs of variable Traditional neural netw network ork lay layers ers use matrix multiplication by a matrix of size. We now describ e each of these ideas in turn. parameters with a separate parameter describing the interaction b etw etween een each T raditional neural netw ork lay ers use matrix multiplication b y a matrix of input unit and each output unit. This means every output unit interacts with every parameters with olutional a separatenet parameter describing the interaction een each input unit. Conv Convolutional netw works, ho how w ev ever, er, typically hav havee sp sparse arseb etw inter interactions actions inputreferred unit andtoeach output unit.ctivity This means every output with every (also as sp sparse arse conne onnectivity or sp sparse arse weights ). unit Thisinteracts is accomplished by input unit. Conv olutional net w orks, ho w ev er, typically hav e sp arse inter actions making the kernel smaller than the input. For example, when pro processing cessing an image, (also referred to as sp arse c onne ctivity or sp arse weights ). This is by the input image might ha hav ve thousands or millions of pixels, but we accomplished can detect small, making the features kernel smaller the input. For that example, when cessing an image, meaningful such asthan edges with kernels o ccup ccupy y onlypro tens or hundreds of the input image might ve need thousands or millions of pixels, but we bcan small, pixels. This means thathawe to store few fewer er parameters, which othdetect reduces the meaningful features such with kernels that ccup y only tens or hundreds of memory requiremen requirements ts of as theedges mo model del and improv improves es oits statistical efficiency efficiency. . It also pixels. that This computing means thatthe we output need to requires store fewfewer er parameters, which b oth reduces the means op operations. erations. These improv improvements ements memory requiremen ts of the mo del and improv es its statistical efficiency . It also in efficiency are usually quite large. If there are m inputs and n outputs, then means that computingrequires the output fewer opthe erations. Theseused improv ements matrix multiplication parameters and algorithms in practice m × nrequires m inputs in usually quite large. If there and outputs, then n ) runtime ha hav vefficiency e O(m × are (p (per er example). If weare limit the num numb b ern of connections matrix multiplication parameters the algorithms used in practice m nthe eac each h output ma may y ha hav vrequires e to k, then sparsely and connected approac approach h requires only O ( m n ha v e ) runtime (p er example). If w e limit the num b er of connections × k × n parameters and O(k × n) run runtime. time. For many practical applications, it is k, then the on eac h output may ha voedtop erformance sparsely connected approac requires only p ossible to ×obtain go goo the machine learning taskh while keeping k sev neral O(k nsmaller parameters ) runtime. or. many practical demonstrations applications, it of is several orders ofand magnitude thanFm F For or graphical k p ossible to obtain go o d pFig. erformance the9.3machine learning task whilenetw keeping × connectivity × 9.2 and on sparse connectivity, , see Fig. . In a deep conv convolutional olutional network, ork, sev eral orders of magnitude smaller than . F or graphical demonstrations of k m units in the deep deeper er lay layers ers may indir indireectly interact with a larger p ortion of the input, sparse connectivity Fig. 9.2 9.3. toInefficiently a deep conv olutional network, as sho shown wn in Fig. 9.4, .see This allo allows ws and the Fig. net netw work describ describe e complicated units in the deep ers may indirectly interact withsuch a larger p ortion of thesimple input, in interactions teractions b et etw wer eenlay many variables by constructing in interactions teractions from as shownblo in cFig. 9.4.eac This allows the work interactions. to efficiently describ e complicated building bloc ks that each h describe onlynet sparse interactions b etween many variables by constructing such interactions from simple Par Parameter ameter sharing refers to using the same parameter for more than one building blo cks that each describe only sparse interactions. function in a mo model. del. In a traditional neural net, each element of the weigh eightt matrix Par ameter sharing refers to using the same parameter for more than is used exactly once when computing the output of a lay layer. er. It is multiplied by one one function in a mo del. In a traditional neural net, each element of the w eigh t matrix elemen elementt of the input and then nev never er revisited. As a synonym for parameter sharing, is used exactly once when computing the output of athe layver. multiplied by one one can sa say y that a netw network ork has tie tied d weights weights, , b ecause alueItofisthe weigh weightt applied elemen of theisinput and then nevof er arevisited. As a synonym for parameter sharing, to one tinput tied to the value weigh eightt applied elsewhere. In a conv convolutional olutional one cannet, sayeach that member a network d weights , b ecause the value ofofthe applied neural of has the tie kernel is used at every p osition theweigh inputt (except toerhaps one input the valuepixels, of a wdep eigh t applied elsewhere. In a convregarding olutional p someisoftied the to b oundary depending ending on the design decisions neural net, each member of the kernel is used at every p osition of the input the b oundary). The parameter sharing used by the con conv volution operation(except means p erhaps some of the b oundary pixels,set dep on thefor design that rather than learning a separate ofending parameters ev every erydecisions lo location, cation,regarding we learn the b oundary). The parameter sharing used by the convolution operation means that rather than learning a separate set336 of parameters for every lo cation, we learn
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
Figure 9.2: Sp Sparse arse conne onnectivity, ctivity, viewe viewed d fr from om below: We highlight one input unit, x3 , and also highlight the output units in s that are affected by this unit. (T (Top) op) When s is formed Sparse ctivity, viewed3fr x , and Figure 9.2: e highlight one input b y conv convolution olution withconne a kernel of width , om onlybelow: three W outputs are affected by unit, x. (Bottom) s that are affected s is (Top)sparse, also highlight the output unitsmin by this Whenso s is formed When by matrix ultiplication, connectivity is unit. no longer allformed of the 3 (Bottom) x by convolution with by a kx ernel of width , only three outputs are affected by . outputs are affected 3. When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the outputs are affected by x .
337
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
Figure 9.3: Sp Sparse arse conne onnectivity, ctivity, viewe viewed d fr from om ab above: ove: We highligh highlightt one output unit, s3 , and also highligh highlightt the input units in x that affect this unit. These units are known as the Spof arse ctivity, from ab 9.3: W e highligh output unit, s3, only and rFigure eceptive field (T (Top) op) Whenviewe byove: conv convolution olution witht one a kernel of width s3 .conne s is dformed x also highligh t the input units inWhen that thisbyunit. These units are known as the s3 . (Bottom) s isaffect three inputs affect formed matrix multiplication, connectivity recno eptive fieldsparse, (Top) s is formed of s . so When olution with a kernel of width 3, only is longer all of the inputs affectbsy3 conv . three inputs affect s . (Bottom) When s is formed by matrix multiplication, connectivity is no longer sparse, so all of the inputs affect s . g1
g2
g3
g4
g5
h1
h2
h3
h4
h5
x1
x2
x3
x4
x5
Figure 9.4: The receptive field of the units in the deep deeper er lay layers ers of a conv convolutional olutional net netw work is larger than the receptiv receptivee field of the units in the shallow la lay yers. This effect increases if Figure 9.4: receptive field of the units in the deep erconv layers of a conv work the netw network orkThe includes architectural features lik like e strided convolution olution (Fig.olutional 9.12) or net p o oling is larger the receptiv the units the shallowinlayaers. This effectnet increases if (Sec. 9.3).than This means thate field ev even en of though dir direectinconnections conv convolutional olutional are very the netwunits ork includes architectural features likeectly strided convolution 9.12of ) or pe oinput oling sparse, in the deep deeper er lay layers ers can b e indir indire connected to all(Fig. or most th the (Sec. image.9.3). This means that even though direct connections in a convolutional net are very sparse, units in the deep er layers can b e indirectly connected to all or most of th e input image. 338
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
Figure 9.5: Par Parameter ameter sharing: Black arrows indicate the connections that use a particular parameter in two differen differentt models. (T (Top) op) The black arrows indicate uses of the cen central tral ameter sharing: Figure t9.5: Black arrows indicate the connections that usesharing, a particular elemen element of aPar 3-element kernel in a conv convolutional olutional mo model. del. Due to parameter this op) The(Bottom) parameter in twoisdifferen models. black arrows uses of theindicates central single parameter used att all input (T lo locations. cations. The indicate single blac black k arrow elemen a 3-element kernelofinthe a conv olutional Due to parameter this the usetofofthe cen central tral element weigh eight t matrixmo in del. a fully connected mo model. del.sharing, This mo model del (Bottom) single used atsoallthe input lo cations. The single black arrow indicates has noparameter parameterissharing parameter is used only once. the use of the central element of the weight matrix in a fully connected mo del. This mo del has no parameter sharing so the parameter is used only once.
only one set. This do does es not affect the runtime of forward propagation—it is still O(k × n)—but it do does es further reduce the storage requiremen requirements ts of the mo model del to only one set. This do es notkaffect the runtime forward propagation—it is still k m. parameters. Recall that is usually several of orders of magnitude less than O ( k n ) —but it do es further reduce the storage requiremen ts of the mo del to Since m and n are usually roughly the same size, k is practically insignificant k parameters. k is usually m. thatolution several orders ofmore magnitude than × compared to mRecall Convolution is thus dramatically efficientless than dense × n. Conv m n k Since and are usually roughly the same size, is practically insignificant matrix multiplication in terms of the memory requiremen requirements ts and statistical efficiency efficiency.. compared to m depiction is thus dramatically more dense n. Convofolution F or a graphical how parameter sharing works, seeefficient Fig. 9.5than . matrix multiplication in terms of the memory requirements and statistical efficiency. × As an example of b oth of these first two principles in action, Fig. 9.6 sho shows ws how For a graphical depiction of how parameter sharing works, see Fig. 9.5. sparse connectivit connectivity y and parameter sharing can dramatically impro improv ve the efficiency As an example of b oth of these first t w o principles in action, Fig. 9.6 shows how of a linear function for detecting edges in an image. sparse connectivity and parameter sharing can dramatically improve the efficiency the case of conv convolution, theedges particular of parameter sharing causes the of aInlinear function forolution, detecting in anform image. la lay yer to hav havee a prop propert ert erty y called equivarianc quivariancee to translation. To say a function is In the case of conv olution, the of parameter equiv equivariant ariant means that if the inputparticular changes, form the output changessharing in the causes same wthe ay. la y er to hav e a prop ert y called e quivarianc e to translation. T o say a function is. Sp Specifically ecifically ecifically,, a function f (x) is equiv equivarian arian ariantt to a function g if f (g (x)) = g (f (x)) )). equiv if the outputthat changes in the the same way. g b e an In theariant case means of con conv vthat olution, if input we letchanges, any ythe function translates input, f (xolution g gif. fF(or g (xexample, )) = g (flet (x))I. Sp ecifically , athen function ) is equiv ariant istoequiv a function i.e., shifts it, the conv convolution function equivariant ariant to g Ine the case of giving convolution, if we let at b einteger any function that translates input, b a function image brightness coordinates. Let g b e athe function i.e., shifts it, then the conv olution function is equiv ariant to . F or example, g I mapping one image function to another image function, such that I 0 = g (Ilet ) is b e a function giving image brightness at integer coordinates. Let g b e a function mapping one image function to another 339image function, such that I = g (I ) is
CHAPTER 9. CONVOLUTIONAL NETWORKS
the image function with I 0 (x, y ) = I ( x − 1, y). This shifts every pixel of I one unit to the right. If we apply this transformation to I , then apply con conv volution, I ( x, y ) = I ( 1 , y x the image function with ) . This shifts every pixel of I one 0 the result will b e the same as if we applied con conv volution to I , then applied the I unit to the right. If we apply this transformation to , then apply con v olution, − transformation g to the output. When pro processing cessing time series data, this means the result will b e the same as if we applied volution to I , different then applied the that conv convolution olution pro produces duces a sort of timeline con that sho shows ws when features g to the transformation output. pro cessing time in series app appear ear in the input. If we mov movee When an even event t later in time the data, input,this themeans exact that conv olution pro duces a sort of timeline that sho ws when different features same represen representation tation of it will appear in the output, just later in time. Similarly app ear in the input. If we move aan even t later in time in the input, app the ear exact with images, con conv volution creates 2-D map of where certain features appear in same represen tation of it will appear in the output, just later in time. Similarly the input. If we mov movee the ob object ject in the input, its representation will mov movee the with volution creates map where features ear in sameimages, amoun amounttcon in the output. This ais2-D useful forofwhen wecertain kno know w that someapp function thea input. If we e the ob ject pixels in theisinput, representation will movinput e the of small num number bermov of neighboring usefulits when applied to m ultiple same amoun t inexample, the output. This iscessing useful images, for whenit wise useful know that some edges function lo locations. cations. For when pro processing to detect in of a small num ber of neighboring pixels is useful when applied to m ultiple input the first lay layer er of a conv convolutional olutional net netw work. The same edges app appear ear more or less lo cations. or the example, cessing images, is useful toacross detectthe edges in ev everywhere erywhereFin image,when so it pro is practical to shareitparameters entire the firstIn laysome er of cases, a convwe olutional netwish work.toThe same edges appacross ear more less image. may not share parameters the or entire everywhere in the image, is practical to share across entire image. For example, if we so areitpro processing cessing images thatparameters are cropp cropped ed to b ethe centered image. In some cases, we may not wish to share parameters across the entire on an individual’s face, we probably wan wantt to extract differen differentt features at different image. For example, we netw are pro images arethe cropp to bto e centered lo part ofifthe pro the that top of faceed needs lo locations—the cations—the network orkcessing processing cessing look ok for on an individual’s face, we probably wan t to extract differen t features at different ey eyebro ebro ebrows, ws, while the part of the netw network ork pro processing cessing the b ottom of the face needs to lo cations—the part of the netw ork pro cessing the top of the face needs to lo ok for lo look ok for a chin. eyebrows, while the part of the network pro cessing the b ottom of the face needs to Conv equivarian arian ariantt to some other transformations, such lo okCon forvaolution chin. is not naturally equiv as changes in the scale or rotation of an image. Other mec mechanisms hanisms are necessary Con v olution is not naturally equiv arian t to some other transformations, such for handling these kinds of transformations. as changes in the scale or rotation of an image. Other mechanisms are necessary Finally Finally,, some cannot b e pro processed cessed by neural netw networks orks defined by for handling thesekinds kindsofofdata transformations. matrix multiplication with a fixed-shape matrix. Conv Convolution olution enables pro processing cessing Finally some kinds kinds of of data. data cannot b e pro cessed by neural of some of ,these We discuss this further in Sec.netw 9.7.orks defined by matrix multiplication with a fixed-shape matrix. Convolution enables pro cessing of some of these kinds of data. We discuss this further in Sec. 9.7.
9.3
Pooling
A typical lay er of a con conv volutional netw network ork consists of three stages (see Fig. 9.7). In 9.3 Pla oyoling the first stage, the lay layer er p erforms sev several eral con conv volutions in parallel to pro produce duce a set A typical la y er of a con v olutional netw ork consists of three stages (see Fig. 9.7). Ina of linear activ activations. ations. In the second stage, each linear activ activation ation is run through the first stage, the lay er p erforms volutions in activ parallel tofunction. pro duce aThis set nonlinear activ activation ation function, suc such hsev aseral the con rectified linear activation ation of linear activ ations. In the second stage, each linear activ ation is run through stage is sometimes called the dete detector ctor stage. In the third stage, we use a poolinga nonlineartoactiv ation function, as la the function mo modify dify the outputsuc of hthe lay yerrectified further.linear activation function. This stage is sometimes called the detector stage. In the third stage, we use a pooling A p o oling function replaces the output of the net at a certain lo location cation with function to mo dify the output of the layer further. a summary statistic of the nearb nearby y outputs. For example, the max pooling (Zhou A p o oling function replaces the output of theum netoutput at a certain with and Chellappa, 1988) op operation eration rep reports orts the maxim maximum within lo a cation rectangular a summary statistic of the nearby outputs. For example, the max pooling (Zhou and Chellappa, 1988) op eration rep orts 340 the maximum output within a rectangular
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.6: Efficiency of edge dete detection ction ction.. The image on the right was formed by taking eac each h pixel in the original image and subtracting the value of its neighboring pixel on the Efficiency of edge of dete Figure 9.6:sho . The image on the right wasinformed by taking left. This shows ws the strength allction of the vertically oriented edges the input image, eac h pixel in the original image and subtracting the v alue of its neighboring pixel on the whic can b e a useful op for ob detection. Both images are 280 pixels tall. which h operation eration object ject left. This sho ws the strength of all of the v ertically oriented edges in the input image, The input image is 320 pixels wide while the output image is 319 pixels wide. This whic h can b e a can useful eration ject olution detection. Both images pixels tall. transformation b e op describ described ed for by aobconv convolution kernel con containing taining are tw two o280 elements, and The input image pixels widefloating while the output image is 319 pixels wide. This requires 319 p oin oint t op operations erations (tw (two o multiplications and × 280is ×320 3 = 267 267,, 960 transformation can b e describ edtobycompute a convolution containing two elements, and one addition p er output pixel) using kernel con convolution. volution. To describ describe e the same 319 960 280 3 = 267 , requires floating pwould oint op erations transformation with a matrix multiplication tak take e 320 ×(tw 280,, or oand ver 280o ×multiplications 319 × 280 one er output compute using confour volution. To describ the same ×ptries × eigh eighttaddition billion, en entries in the pixel) matrix,tomaking con convolution volution billion times more eefficient for 320 280 280 319 transformation with a matrix multiplication would tak e , or o v er represen representing ting this transformation. The straightforw straightforward ard matrix multiplication algorithm eigh t billion, tries inbillion the matrix, making volution making four billion efficient for × × more × p erforms overensixteen floating p oin ointt con op operations, erations, con conv vtimes olution roughly 60,000 represen tingefficient this transformation. The matrix multiplication times more computationally computationally. . Of straightforw course, mostard of the entries of the matrixalgorithm would b e p erforms er sixteen floatingen ptries oint of opthe erations, making con volution 60,000 zero. If weovstored onlybillion the nonzero entries matrix, then b oth matrix roughly multiplication timesconv more efficient computationally . Of course, of the entries of erations the matrix be and convolution olution would require the same num number bermost of floating p oint op operations to would compute. zero.matrix If we stored nonzero entries of 319 the matrix, oth entries. matrix multiplication 178 178,,b640 The wouldonly stillthe need to contain 2× Conv Convolution olution × 280 =then and olution efficien would require same numtransformations ber of floating pthat oint op erations to compute. is anconv extremely efficient t wa way y ofthe describing apply the same linear 319 280 = 178 , 2 640 The matrix would needlo to entries.credit: Convolution transformation of astill small, local calcontain region across the entire input. (Photo Paula is an extremely efficient way of describing×transformations that apply the same linear × Go Goodfellow) odfellow) transformation of a small, lo cal region across the entire input. (Photo credit: Paula Go odfellow)
341
CHAPTER 9. CONVOLUTIONAL NETWORKS
Complex layer terminology
Simple layer terminology
Next layer
Next layer
Convolutional Layer Pooling stage
Pooling layer
Detector stage: Nonlinearity e.g., rectified linear
Detector layer: Nonlinearity e.g., rectified linear
Convolution stage: Affine transform
Convolution layer: Affine transform
Input to layer
Input to layers
Figure 9.7: The comp componen onen onents ts of a typical conv convolutional olutional neural netw network ork lay layer. er. There are two commonly used sets of terminology for describing these la lay yers. (L (Left) eft) In this terminology terminology,, Figure 9.7: The comp onen ts of a t ypical conv olutional neural netw ork lay er. There areeach two the conv convolutional olutional net is viewed as a small num number ber of relativ relatively ely complex lay layers, ers, with (L eft) commonly used sets“stages.” of terminology describing, there these is layaers. this terminology la lay yer having many In this for terminology terminology, one-to-oneInmapping b et etw ween, the conv olutional isork viewed as aInsmall relatively layers, with each k ernel tensors andnet net netw w lay layers. ers. this bnum o okber we of generally usecomplex this terminology terminology. . (Right) la y er having many “stages.” In this terminology , there is a one-to-one mapping b et w een In this terminology terminology,, the conv convolutional olutional net is viewed as a larger num number ber of simple lay layers; ers; (Right) k ernel tensors and network layers. Inasthis b oerokinwits e generally useThis thismeans terminology . every ev every ery step of pro processing cessing is regarded a lay layer own right. that not Inythis terminology , the convolutional net is viewed as a larger number of simple layers; “la “lay er” has parameters. every step of pro cessing is regarded as a layer in its own right. This means that not every “layer” has parameters.
342
CHAPTER 9. CONVOLUTIONAL NETWORKS
neigh neighb b orhoo orhood. d. Other p opular p o oling functions include the average of a rectangular neigh neighb b orhoo orhood, d, the L 2 norm of a rectangular neighborho neighborhoo o d, or a weigh eighted ted average neigh b orhoo d. Other p opular p o oling functions include the a v erage of a rectangular based on the distance from the cen central tral pixel. L neighb orhoo d, the norm of a rectangular neighborho o d, or a weighted average In on all the cases, p o oling helps make representation b ecome appro approximately ximately based distance from thetocen tral the pixel. invariant to small translations of the input. Inv Invariance ariance to translation means that if In all cases, p o oling to amount, make thethe representation we translate the input byhelps a small values of mostb ecome of the pappro o oledximately outputs invariant to small translations of the input. Inv ariance to translation means thattoif do not change. See Fig. 9.8 for an example of how this works. In Inv variance w ecal translate the input by be a small amount, theprop values of most of care the p omore oled outputs lo local translation can a very useful property erty if we ab about out In v ariance to do not change. See Fig. 9.8 for an example of how this w orks. whether some feature is present than exactly where it is. For example, lo cal determining translationwhether can beanaimage verycontains useful prop if we ab out when a face,erty we need notcare knowmore the lo location cation whether feature is present it is. For example, of the ey eyes essome with pixel-p pixel-perfect erfect accuracy accuracy,, than we justexactly need to where kno know w that there is an ey eyee on when determining whether an image contains a face, we need not know the lo cation the left side of the face and an eye on the righ rightt side of the face. In other contexts, of the ey es with pixel-p erfect accuracy , w e just to know that there isifan ont it is more imp important ortant to preserv preservee the lo location cation ofneed a feature. For example, weeywean ant thefind left aside of the face and ano eye onmeeting the rightatside ofecific the orien face. tation, In other to corner defined by tw edges a sp specific orientation, wecontexts, need to it is more imp ortant to preserv e the lo cation of a feature. F or example, if we preserv preservee the lo location cation of the edges well enough to test whether they meet. want to find a corner defined by two edges meeting at a sp ecific orientation, we need to The use of p ooling can b e viewed as adding an infinitely strong prior that preserve the lo cation of the edges well enough to test whether they meet. the function the lay layer er learns must b e inv invarian arian ariantt to small translations. When this The useisofcorrect, p ooling can greatly b e viewed as eadding an infinitely strong prior assumption it can improv improve the statistical efficiency of the net netw wthat ork. the function the layer learns must b e invariant to small translations. When this Po oling is ov over er spatial regions produces invariance ariance to translation, butthe if net we wpork. o ol assumption correct, it can greatly improvinv e the statistical efficiency of over the outputs of separately parametrized con conv volutions, the features can learn P o oling ov er spatial regions produces inv ariance to translation, but if we p o ol whic which h transformations to b ecome inv invariant ariant to (see Fig. 9.9). over the outputs of separately parametrized convolutions, the features can learn Because p ooling summarizes the resp responses onses over a whole neigh neighb b orhoo orhood, d, it is which transformations to b ecome invariant to (see Fig. 9.9). p ossible to use fewer p o oling units than detector units, by rep reporting orting summary Because p ooling summarizes the resp onses o v er a whole neigh b orhoo d, it is statistics for p ooling regions spaced k pixels apart rather than 1 pixel apart. See p ossible fewer p o oling unitsvthan units, efficiency by rep orting Fig. 9.10 to for use an example. This impro es thedetector computational of thesummary netw improv network ork k statistics for p ooling regions spaced pixels apart rather than 1 pixel apart. b ecause the next lay layer er has roughly k times fewer inputs to pro process. cess. When See the Fig. 9.10 forparameters an example.inThis es er theiscomputational the(suc netw n um umb b er of the impro next vlay layer a function of efficiency its inputofsize (such hork as k b ecause the next lay er has roughly times fewer inputs to pro cess. When the when the next lay layer er is fully connected and based on matrix multiplication) this n um b er of parameters in the er is in a improv function its inputefficiency size (suchand as reduction in the input size cannext alsolay result improved edofstatistical when thememory next layrequiremen er is fully ts connected andthe based on matrix multiplication) this reduced requirements for storing parameters. reduction in the input size can also result in improved statistical efficiency and For many tasks, p o oling ts is for essential inputs of varying size. F For or reduced memory requiremen storingfor thehandling parameters. example, if we wan antt to classify images of variable size, the input to the classification F or many tasks, p osize. oling This is essential foraccomplished handling inputs varying size.ofFan or la lay yer must hav havee a fixed is usually by vof arying the size example, if we wpan t to classify variable size, the lay input to the offset b et etw ween o oling regionsimages so thatofthe classification layer er alwa always ys classification receives the la y er must hav e a fixed size. This is usually accomplished b y v arying the size ofthe an same num umb b er of summary statistics regardless of the input size. For example, offset b et w een p o oling regions so that the classification lay er alwa ys receives the final p ooling lay layer er of the net netw work ma may y be defined to output four sets of summary same n um b er of summary statistics regardless of the input size.image For example, the statistics, one for eac each h quadran quadrantt of an image, regardless of the size. final p ooling layer of the network may be defined to output four sets of summary Some theoretical work giv gives es guidance as to which kinds of p o oling one should statistics, one for each quadrant of an image, regardless of the image size. 343as to which kinds of p o oling one should Some theoretical work gives guidance
CHAPTER 9. CONVOLUTIONAL NETWORKS
POOLING STAGE ...
1.
1.
1.
0.2
...
...
0.1
1.
0.2
0.1
...
...
DETECTOR STAGE
POOLING STAGE ...
...
0.3
1.
1.
1.
0.3
0.1
1.
0.2
...
DETECTOR STAGE
Figure 9.8: Max po pooling oling introduces in inv variance. (T (Top) op) A view of the middle of the output of a conv convolutional olutional la lay yer. The bottom ro row w shows outputs of the nonlinearity nonlinearity.. The top (T op) Figure 9.8: Max po oling introduces in v ariance. A view of the middle of the regions output ro row w shows the outputs of max p o oling, with a stride of one pixel betw between een p ooling of a aconv olutional laywidth er. The bottom row shows outputs of of thethe nonlinearity . ork, Theafter top and p ooling region of three pixels. (Bottom) A view same netw network, row input showshas the boutputs of max p o oling, a stride one pixel eenbpottom oolingrow regions the een shifted to the right with by one pixel. ofEvery valuebetw in the has a p ooling region threeinpixels. view of bthe samethe netw ork, after cand hanged, but only halfwidth of theof values the top(Bottom) row hav haveeAchanged, ecause max p ooling the input has sensitiv b een shifted to maxim the right oneinpixel. Every value theitsb ottom row has units are only sensitive e to the maximum umby value the neighborho neighborhoo o d,innot exact lo location. cation. changed, but only half of the values in the top row have changed, b ecause the max p ooling units are only sensitive to the maximum value in the neighborho o d, not its exact lo cation.
344
CHAPTER 9. CONVOLUTIONAL NETWORKS
Large response in pooling unit
Large response in pooling unit
Large response in detector unit 1
Large response in detector unit 3
Figure 9.9: Example of le learne arne arned d invarianc invariances: es: A p o oling unit that p o ols over multiple features that are learned with separate parameters can learn to be in inv variant to transformations of Example of learne es: learned Figure 9.9: Here A p o oling unitand thata pmax o olspoover multiple features the input. we show howdainvarianc set of three filters oling unit can learn that are learned with parameters learn be invariant to transformations to b ecome inv invarian arian ariant t toseparate rotation. All three can filters are to intended to detect a hand-written of 5. the input. Here w e show how a set of three learned filters and a max p o oling unit can learn Eac Each h filter attempts to match a slightly different orientation of the 5. When a 5 app appears ears in to binput, ecome the invarian t toonding rotation. filters are cause intended to detect a hand-written 5. the corresp corresponding filterAll willthree match it and a large activ activation ation in a detector Each filter attempts to match a slightly orientation of the 5.ofWhen 5 app ears in unit. The max po pooling oling unit then has adifferent large activ activation ation regardless whic which ha po pooling oling unit the the corresp onding it and cause tawo large activation in aresulting detector w as input, activ activated. ated. We show herefilter ho how w will the match net network work pro processes cesses different inputs, unit. max po oling unit then hasactiv a large activ ation regardless ofoling whicunit h pois oling unit in twoThe different detector units b eing activated. ated. The effect on the po pooling roughly w as same activated. e yshow how the network pro two different inputs, resulting the eitherW wa way . Thishere principle is leveraged bycesses maxout netw networks orks (Go Goodfellow odfellow et al., in two) different units bnet eing activated. effect onspatial the popoling unitisisnaturally roughly 2013a 2013a) and otherdetector conv convolutional olutional netw works. Max pThe o oling over ositions al., the same to either way. This principle is leveraged by hmaxout orks (for Go odfellow in inv variant translation; this multi-c multi-channel hannel approac approach is only netw necessary learning et other 2013a) and other convolutional networks. Max p o oling over spatial p ositions is naturally transformations. invariant to translation; this multi-channel approach is only necessary for learning other transformations. 1.
0.1
1.
0.2
0.2
0.1
0.1
0.0
0.1
Figure 9.10: Po Pooling oling with downsampling downsampling.. Here we use max-p max-po o oling with a p ool width of three and a stride b etw etween een po pools ols of two. This reduces the representation size by a factor oling with Figure Herestatistical we use max-p o oling withnext a p ool width of of two, 9.10: whichPo reduces the downsampling computational. and burden on the lay layer. er. Note threethe andrigh a stride een po ols ofhas two. This reduces the must representation sizeifby factor that rightmost tmostbpetw ooling region a smaller size, but b e included weado not ofan two, w ant t to which ignore reduces some ofthe thecomputational detector units.and statistical burden on the next layer. Note that the rightmost p ooling region has a smaller size, but must b e included if we do not want to ignore some of the detector units. 345
CHAPTER 9. CONVOLUTIONAL NETWORKS
use in various situations (Boureau et al., 2010). It is also p ossible to dynamically p ool features together, for example, by running a clustering algorithm on the use in various situations features (Boureau(Boureau et al., 2010 is also p ossible to dynamically lo locations cations of interesting et ). al.It , 2011 ). This approach yields a p ool features together, for example, b y running a clustering algorithm on the differen differentt set of p o oling regions for each image. Another approac approach h is to learn a lo cations of interesting features ( Boureau et al. , 2011 ). This approach yields a single p ooling structure that is then applied to all images (Jia et al., 2012). different set of p o oling regions for each image. Another approach is to learn a Po oling can complicate some kinds of neural net netw work architectures that use single p ooling structure that is then applied to all images (Jia et al., 2012). top-do top-down wn information, suc such h as Boltzmann machines and auto autoenco enco encoders. ders. These P o oling can complicate some kinds of neural net w ork architectures use issues will b e discussed further when we present these typ ypes es of netw networks orksthat in Part top-do wn information, suc h as Boltzmann machines and auto enco ders. These I II II.. Pooling in conv convolutional olutional Boltzmann machines is presented in Sec. 20.6. The issues willebop e erations discussedonfurther present thesedifferentiable typ es of netwnetw orksorks in Part in inv verse-lik erse-like operations p o olingwhen unitswe needed in some networks will IbII . Pooling in conv olutional Boltzmann machines is presented in Sec. 20.6 . The e co cov vered in Sec. 20.10.6. inverse-like op erations on p o oling units needed in some differentiable networks will Some examples of complete conv convolutional olutional netw network ork architectures for classification b e covered in Sec. 20.10.6. using con conv volution and p ooling are shown in Fig. 9.11. Some examples of complete convolutional network architectures for classification using convolution and p ooling are shown in Fig. 9.11.
9.4
Con Conv volution and P Po ooling as an Infinitely Strong Prior 9.4 Convolution and Pooling as an Infinitely Strong Recall the concept of a prior pr prob ob obability ability distribution from Sec. 5.2. This is a Prior probabilit probability y distribution ov over er the parameters of a model that enco encodes des our b eliefs Recall the concept of a prior pr ob ability distribution from Sec. 5.2 . This is a ab about out what models are reasonable, b efore we hav havee seen any data. probability distribution over the parameters of a model that enco des our b eliefs Priors can b e considered weak or strong dep depending ending on how concentrated the ab out what models are reasonable, b efore we have seen any data. probabilit probability y densit density y in the prior is. A weak prior is a prior distribution with high Priors can as b eaconsidered weak or strong ending on how concentrated the en entrop trop tropy y, such Gaussian distribution withdep high variance. Such a prior allows probabilit y densit in parameters the prior is.more A weak prior is a. prior distribution the data to mo mov ve ythe or less freely freely. A strong prior haswith veryhigh low en trop y , such as a Gaussian distribution with high v ariance. Such a prior allows en entrop trop tropy y, such as a Gaussian distribution with lo low w variance. Suc Such h a prior plays a the data to mo v e the parameters more or less freely . A strong prior has very low more activ activee role in determining where the parameters end up. entropy, such as a Gaussian distribution with low variance. Such a prior plays a Anactiv infinitely prior places zerothe probabilit probability y onend some parameters and says more e role strong in determining where parameters up. that these parameter values are completely forbidden, regardless of how muc much h An infinitely strong prior places zero probabilit y on some parameters and says supp support ort the data giv gives es to those values. that these parameter values are completely forbidden, regardless of how much We can imagine a conv convolutional olutional net as b eing similar to a fully connected net, supp ort the data gives to those values. but with an infinitely strong prior over its weigh weights. ts. This infinitely strong prior W e can imagine a conv olutional net as b eing similar to ato fully sa says ys that the weigh weights ts for one hidden unit must b e identical theconnected weigh weights ts ofnet, its but with an infinitely strong prior o v er its weigh ts. This infinitely strong prior neigh neighb b or, but shifted in space. The prior also sa says ys that the weigh eights ts must b e zero, sa ys that the weigh ts for one hidden unit must b e identical to the weigh of its except for in the small, spatially contiguous receptive field assigned to thattshidden neighbOv or,erall, but shifted space. Theuse prior also olution says that weights man ustinfinitely b e zero, unit. Overall, we can in think of the of conv convolution asthe introducing except for in probability the small, spatially contiguous receptive field assigned to that strong prior distribution ov over er the parameters of a lay layer. er. Thishidden prior unit. Ov erall, we can think of the use of conv olution as introducing an infinitely sa says ys that the function the la lay yer should learn con contains tains only lo local cal in interactions teractions and is strong prior probability distribution over the parameters of a layer. This prior 346 contains only lo cal interactions and is says that the function the layer should learn
CHAPTER 9. CONVOLUTIONAL NETWORKS
Output of softmax: 1,000 class probabilities
Output of softmax: 1,000 class probabilities
Output of softmax: 1,000 class probabilities
Output of matrix multiply: 1,000 units
Output of matrix multiply: 1,000 units
Output of average pooling: 1x1x1,000
Output of reshape to vector: 16,384 units
Output of reshape to vector: 576 units
Output of convolution: 16x16x1,000
Output of pooling with stride 4: 16x16x64
Output of pooling to 3x3 grid: 3x3x64
Output of pooling with stride 4: 16x16x64
Output of convolution+ReLU: 64x64x64
Output of convolution+ReLU: 64x64x64
Output of convolution+ReLU: 64x64x64
Output of pooling with stride 4: 64x64x64
Output of pooling with stride 4: 64x64x64
Output of pooling with stride 4: 64x64x64
Output of convolution+ ReLU: 256x256x64
Output of convolution+ ReLU: 256x256x64
Output of convolution+ ReLU: 256x256x64
Input image: 256x256x3
Input image: 256x256x3
Input image: 256x256x3
Figure 9.11: Examples of architectures for classification with conv convolutional olutional netw networks. orks. The sp specific ecific strides and depths used in this figure are not advisable for real use; they are Figure 9.11: Examples of architectures forfitclassification withReal convolutional networks. The designed to be very shallow in order to onto the page. conv convolutional olutional net netw works sp ecific strides and depths used in this are notunlike advisable for real use; theyused are also often inv involv olv olve e significant amoun amounts ts offigure branching, the chain structures designed to be very shallow in order to fit onto the page. Real conv olutional net w orks here for simplicity simplicity.. (L (Left) eft) A con convolutional volutional net netw work that pro processes cesses a fixed image size. also often involvebsignificant amoun ts and of branching, chain After alternating etw etween een con conv v olution p o oling forunlike a few the lay layers, ers, thestructures tensor forused the eft) is herevolutional for simplicity . (Lmap A reshap convolutional network pro cesses a fixed image con convolutional feature reshaped ed to flatten out that the spatial dimensions. The size. rest After b etw een confeedforward volution andnet p owoling for a few as laydescrib ers, the for the of thealternating net netw work is an ordinary netw ork classifier, described edtensor in Chapter 6. convolutional feature map is reshap tocesses flatten the spatial dimensions. The rest (Center) A con conv volutional net network work thatedpro processes a vout ariable-sized image, but still maintains the net work is section. an ordinary ork classifier, aswith describ ed in Chapter 6. aoffully connected This feedforward net network work usesnet a pwooling op operation eration variably-sized p o ols (Center) A con v olutional net work that pro cesses a v ariable-sized image, but still maintains but a fixed num umb b er of po pools, ols, in order to pro provide vide a fixed-size vector of 576 units to the a fullyconnected connectedpsection. work a p ooling opolutional eration with variably-sized ols fully ortion ofThis the net net netw work.uses (Right) A conv convolutional netw network ork that doesp onot but fixed num b er of poweigh ols, int order providethe a fixed-size vector oflay 576 to one the ha hav veaany fully connected weight lay layer. er. to Instead, last conv convolutional olutional layer er units outputs fully connected ortionThe of the netw ork. (Right) A conv olutional orkeach thatclass doesisnot feature map p er pclass. mo model del presumably learns a map of ho how wnetw likely to ve any fullyspatial connected weighA t veraging layer. Instead, the map last conv layer outputs one oha ccur at each lo location. cation. a feature downolutional to a single value provides feature map p er The moclassifier del presumably learns a map of how likely each class is to the argument to class. the softmax at the top. o ccur at each spatial lo cation. Averaging a feature map down to a single value provides the argument to the softmax classifier at the top. 347
CHAPTER 9. CONVOLUTIONAL NETWORKS
equiv equivariant ariant to translation. Lik Likewise, ewise, the use of p o oling is an infinitely strong prior that eac each h unit should b e in inv varian ariantt to small translations. equivariant to translation. Likewise, the use of p o oling is an infinitely strong prior Of course, implemen implementing ting a con conv volutional net as a fully connected net with an that each unit should b e invariant to small translations. infinitely strong prior would b e extremely computationally wasteful. But thinking course, implemen a con volutionalnet netwith as aan fully connected netprior withcan an of aOf con conv volutional net asting a fully connected infinitely strong infinitely strong prior would bwe con extremely computationally giv give e us some insigh insights ts in into to ho how conv volutional nets work. wasteful. But thinking of a convolutional net as a fully connected net with an infinitely strong prior can One key insight is that con conv volution and p o oling can cause underfitting. Like give us some insights into how convolutional nets work. an any y prior, conv convolution olution and p o oling are only useful when the assumptions made One k ey insight is that con volution p orelies oling can cause underfitting. Like by the prior are reasonably accurate. If and a task on preserving precise spatial any prior, conv olution p o oling useful the assumptions information, then usingand p o oling on are all only features canwhen increase the training made error. b y the prior are reasonably accurate. If a task relies on preserving precise spatial Some con conv volutional net netw work arc architectures hitectures (Szegedy et al., 2014a) are designed to information, then using p o oling onnot all on features increase training use p o oling on some channels but other can channels, in the order to get error. b oth Some con v olutional net w ork arc hitectures ( Szegedy et al. , 2014a ) are designed to highly in inv variant features and features that will not underfit when the translation use p o olingprior on some channels but anot other in order to get bfrom oth in inv variance is incorrect. When taskoninv involves olves channels, incorp incorporating orating information invariant features andinput, features not underfit the translation vhighly ery distan distant t lo locations cations in the thenthat thewill prior imp imposed osed bywhen conv convolution olution ma may y be in v ariance prior is incorrect. When a task inv olves incorp orating information from inappropriate. very distant lo cations in the input, then the prior imp osed by convolution may b e Another key insigh insightt from this view is that we should only compare con conv voluinappropriate. tional mo models dels to other conv convolutional olutional mo models dels in b enchmarks of statistical learning Another key insigh t from viewconv is that we would shouldbonly compare volup erformance. Mo Models dels that do this not use convolution olution e able to learncon even if tional mo dels to other conv olutional mo dels in b enchmarks of statistical learning we p erm ermuted uted all of the pixels in the image. F For or many image datasets, there are p erformance. Mo dels that do not use conv olution would b e able learndiscov even er if separate b enc enchmarks hmarks for mo models dels that are permutation invariant andtomust discover w e pconcept ermutedofall of the via pixels in theand image. Forthat many datasets, are the top topology ology learning, mo models dels ha hav vimage e the kno knowledge wledge there of spatial separate b enchard-co hmarks ded for mo dels thatbyare permutation relationships hard-coded in into to them their designer. invariant and must discover the concept of top ology via learning, and mo dels that have the knowledge of spatial relationships hard-co ded into them by their designer.
9.5
Varian ariants ts of the Basic Con Conv volution Function
When conv convolution in the con context textvof neural netw networks, orks, we usually do 9.5 discussing Variants ofolution the Basic Con olution Function not refer exactly to the standard discrete con conv volution op operation eration as it is usually When discussing conv olution in the con text of neural netw orks, e usually do understo understood od in the mathematical literature. The functions used in w practice differ not refer exactly the standard discrete con olutionand op eration as it is usually sligh we to describ invdetail, highlight some useful slightly tly tly.. Here describe e these differences understo od in the mathematical literature. The functions used in practice differ prop properties erties of the functions used in neural net netw works. slightly. Here we describ e these differences in detail, and highlight some useful wefunctions refer to conv convolution olution in the networks, orks, we usually propFirst, ertieswhen of the used in neural netcontext works. of neural netw actually mean an op operation eration that consists of many applications of con conv volution in First,This wheniswe refer toconv conv olutionwith in the context of neural netwextract orks, weone usually parallel. b ecause convolution olution a single kernel can only kind actually mean an op eration that consists of many applications of con v olution in of feature, alb albeit eit at many spatial lo locations. cations. Usually we wan wantt eac each h lay layer er of our parallel. This is b ecause olution with a single kyernel can only extract one kind net netw work to extract man many yconv kinds of features, at man many lo locations. cations. of feature, alb eit at many spatial lo cations. Usually we want each layer of our Aork dditionally dditionally, , theman input is usually not just gridy of real values. Rather, it is a netw to extract y kinds of features, at aman lo cations. Additionally, the input is usually not 348just a grid of real values. Rather, it is a
CHAPTER 9. CONVOLUTIONAL NETWORKS
grid of vector-v ector-valued alued observ observations. ations. F For or example, a color image has a red, green and blue in intensit tensit tensity y at eac each h pixel. In a multila multilayer yer conv convolutional olutional netw network, ork, the input grid of v ector-v alued observ ations. F or example, a color image has red,output green to the second lay layer er is the output of the first lay layer, er, which usually hasa the andman blue intensity conv at eac h pixel.atIneach a multila yer conv olutional netw ork,images, the input of many y different convolutions olutions p osition. When working with we to the second lay er is the output of the first lay er, which usually has the output usually think of the input and output of the conv convolution olution as b eing 3-D tensors, with of man y different olutions at each p osition. When working with co images, we one index in into to theconv different channels and tw two o indices into the spatial coordinates ordinates usually of the input output of the conv olution as in b eing 3-D tensors, with of eac each h think channel. Soft Softw wareand implementations usually work batc batch h mode, so they one actually index into different two axis indices into the spatial examples co ordinates will usethe 4-D tensors,channels with theand fourth indexing different in of eac h channel. Soft w are implementations usually work in batc h mode, so they the batc batch, h, but we will omit the batc batch h axis in our description here for simplicit simplicity y. will actually use 4-D tensors, with the fourth axis indexing different examples in conv convolutional netw multi-channel hannel conv convolution, the Because batch, but weolutional will omitnet theworks batchusually axis in use our multi-c description here forolution, simplicitthe y. linear op operations erations they are based on are not guaranteed to b e comm commutativ utativ utative, e, ev even en if Because conv netw orks usually useerations multi-care hannel olution, the kernel-flipping is olutional used. These multi-c multi-channel hannel op operations onlyconv comm commutativ utativ utative e if linear op erations they are based on are not guaranteed to b e comm utativ e, ev en if eac each h op operation eration has the same num umber ber of output channels as input channels. kernel-flipping is used. These multi-channel op erations are only commutative if Assume we hav havee a 4-D kernel tensor with elemen elementt K giving the connection each op eration has the same number ofK output channelsi,j,k,l as input channels. strength b et etw ween a unit in channel i ofKthe output andK a unit in channel j of the Assume w e hav e a 4-D tensor with elemen connection l columns input, with an offset of kkernel ro rows ws and b et etw wteen the giving outputthe unit and the i j strength b et w een a unit in channel of the output and a unit in c hannel of the input unit. Assume our input consists of observ observed ed data V with element Vi,j,k giving k rowithin l columns input, with ws andchannel b et the output unit and our the i at ro the value of an theoffset inputofunit row wwjeen and V column k . Assume V input unit. Assume input of observ with element giving Z is pro K output consists of Zour with the consists same format as Ved . Ifdata produced duced by conv convolving olving i j k the v alue of the input unit within c hannel at ro w and column . Assume our across V without flipping K, then Z V Z K output consists of with the same X format as . If is pro duced by convolving V K Zi,j,k, = (9.7) across without flipping then V l,j +m−1,k +n−1Ki,l,m,n l,m,n V Z K = (9.7) where the summation over l , m and n is ov over er all values for which the tensor indexing op operations erations inside the summation is valid. In linear algebra notation, we index into where summation over and. nThis is ovnecessitates er all values the for which l, m − 1 in the arra arrays ysthe using a 1 for the first entry entry. thetensor ab abo ove indexing form formula. ula. op erations inside the summation is v alid. In linear algebra notation, we index into X Programming languages such as C and Python index starting from 0, rendering arra ys using a 1 for the first entry the ab abo ove expression even simpler.. This necessitates the 1 in the ab ove formula. Programming languages such as C and Python index starting from 0, rendering − We may wan wantt to skip ov over er some positions of the kernel in order to reduce the the ab ove expression even simpler. computational cost (at the exp expense ense of not extracting our features as finely). We W e may wan t to skip ov er some positions invorder reduce the can think of this as downsampling the outputofofthe thekernel full con conv olutiontofunction. If computational cost (at the exp ense of not extracting our features as finely). W we wan antt to sample only ev every ery s pixels in eac each h direction in the output, then we cane can think ofwnsampled this as downsampling outputc of the full convolution function. If defined a do downsampled con conv volutionthe function suc such h that we want to sample only every s pixels Xin each direction in the output, then we can = c ( K , V , s ) = Z Vl,(j −1) (9.8) defined a doi,j,k wnsampled coni,j,k volution function c ×suc h (that s+m, k −1)×s+nKi,l,m,n . l,m,n V K V Z K = c( , , s) = . (9.8) We refer to s as the stride of this downsampled conv convolution. olution. It is also p ossible to define a separate stride for each direction of motion. See Fig. 9.12 for an W e refer to s as the stride of this downsampled convolution. It is also p ossible illustration. to define a separate stride for each of motion. See Fig. 9.12 for an Xdirection 349 illustration.
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
s2
s3
Strided convolution x1
x2
x3
s1
x4
s2
x5
s3
Downsampling
z1
z2
z3
z4
z5
x1
x2
x3
x4
x5
Convolution
Figure 9.12: Con Conv volution with a stride. In this example, we use a stride of two. (T (Top) op) Con Convolution volution with a stride length of tw two o implemented in a single op operation. eration. (Bottom) (Top) Figure 9.12: with Conavolution with athan stride. In this example, we use a stride of conv two.olution Con Conv volution stride greater one pixel is mathematically equiv equivalent alent to convolution (Bottom) Convolution with aw stride of two implemented in atw single eration.h in with unit stride follo follow ed bylength do downsampling. wnsampling. Ob Obviously viously viously,, the two-step o-stepopapproac approach inv volving Con volution with stride greater than one pixelb ecause is mathematically equiv alentvalues to conv olution do downsampling wnsampling is acomputationally wasteful, it computes many that are with unit stride follo w ed b y do wnsampling. Ob viously , the tw o-step approac h in v olving then discarded. downsampling is computationally wasteful, b ecause it computes many values that are then discarded.
350
CHAPTER 9. CONVOLUTIONAL NETWORKS
One essential feature of an any y conv convolutional olutional netw network ork implemen implementation tation is the ability to implicitly zero-pad the input V in order to make it wider. Without this feature, essential of any conv olutional network implementation is the ability the One width of the feature representation Vshrinks by one pixel less than the kernel width to implicitly the input in order to make wider. the Without feature, at eac each h la lay yer.zero-pad Zero padding the input allows us toitcontrol kernelthis width and the width of the representation shrinks by one pixel less than the kernel width the size of the output indep independen enden endently tly tly.. Without zero padding, we are forced to eac ywer. padding input allows us to control the kernel widthsmall and cat ho hoose osehbla et etw eenZero shrinking the the spatial extent of the net netw work rapidly and using the size ofoth thescenarios output indep endently. Without padding, weofare to k ernels—b ernels—both that significantly limit the zero expressive pow power er theforced netw network. ork. cho ose ween the spatial extent of the network rapidly and using small See Fig.b et 9.13 forshrinking an example. kernels—b oth scenarios that significantly limit the expressive power of the network. Three sp special ecial cases of the zero-padding setting are worth men mentioning. tioning. One is See Fig. 9.13 for an example. the extreme case in which no zero-padding is used whatso whatsoever, ever, and the con conv volution Three sp ecial cases of the zero-padding setting are w orth men tioning. One is kernel is only allow allowed ed to visit p ositions where the en entire tire kernel is con contained tained en entirely tirely the extreme case inIn which no zero-padding is used ever, andconv the olution. convolution within the image. MA MATLAB TLAB terminology terminology, , thiswhatso is called valid convolution. In k ernel is only allow ed to visit p ositions where the en tire kernel is con tained en tirely this case, all pixels in the output are a function of the same num numb b er of pixels in within the so image. In MATLAB terminology is calledmore validregular. convolution. In the input, the b ehavior of an output pixel ,isthis somewhat How Howev ev ever, er, this case, all pixels in the output are a function of the same num b er of pixels in the size of the output shrinks at each lay layer. er. If the input image has width m and the input, b ehavior an output somewhat er, k , theof output m − kmore the kernel so hasthe width willpixel b e ofiswidth + 1. regular. The rateHow of ev this m the size of the output shrinks at each lay er. If the input image has width and shrink shrinkage age can b e dramatic if the kernels used are large. Since the shrink shrinkage age is k , the k +that the kernel has0,width the num output b e of width m 1. The of this greater than it limits numb b erwill of conv convolutional olutional lay layers ers can rate b e included shrink can dramatic if the kernels aredimension large. the netw shrink age is − Since in the age net netw w ork.b eAs la lay yers are added, the used spatial of the network ork will greater than 0, ittolimits b er pofoint conv olutional lay layers ers that canmeaningfully b e included ev even en entually tually drop 1 × 1the , atnum which additional layers cannot in the net w ork. As la y ers are added, the spatial dimension of the netw ork will b e considered conv convolutional. olutional. Another sp special ecial case of the zero-padding setting is eventually drop to 1 1, at which p oint additional when just enough zero-padding is added to keep the sizelay ofers thecannot outputmeaningfully equal to the b e considered conv olutional. Another sp ecial case of theInzero-padding is × size of the input. MA MATLAB TLAB calls this same conv convolution. olution. this case, thesetting netw network ork when just enough zero-padding is added to as keep size of the output can equal to the can con contain tain as many conv convolutional olutional lay layers ers thethe av available ailable hardware supp support, ort, size of the input. MA TLAB calls this same conv olution. In this case, the netw ork since the op operation eration of conv convolution olution do does es not mo modify dify the architectural p ossibilities contain olutional laythe ers as the pixels available hardware caninfluence supp ort, acan vailable to as themany next conv la lay yer. Ho How wev ever, er, input near the b order since op eration convthe olution es not near mo dify architectural few fewer er the output pixels ofthan inputdopixels thethe cen center. ter. This canp ossibilities make the a v ailable to the next la y er. Ho w ev er, the input pixels near the b order influence b order pixels somewhat underrepresen underrepresented ted in the mo model. del. This motiv motivates ates the other few er output pixels than the input pixels near the cen ter. This can make the extreme case, whic which h MA MATLAB TLAB refers to as ful fulll convolution onvolution,, in which enough zero zeroes es b order pixels underrepresen ted in the mo del. This motiv ates the other k are added for somewhat ev pixel to b e visited times in eac direction, resulting in an every ery each h extremeimage case, whic h MAm TLAB as case, ful l convolution which enough es output of width + k −refers 1. Intothis the output, in pixels near the bzero order are added for ev pixel to b ethan visited times in eachnear direction, resulting an are a function ofery few fewer er pixels thekoutput pixels the center. Thisincan output image of width m + k 1 . In this case, the output pixels near the b order mak makee it difficult to learn a single kernel that performs well at all p ositions in are acon function of few er pixels the output pixelsamoun near the This can −than the conv volutional feature map. Usually the optimal amount t of center. zero padding (in make of it test difficult to learn a single kernel performs well all p ositions in terms set classification accuracy) liesthat somewhere b et etw weenat “v “valid” alid” and “same” the volutional feature map. Usually the optimal amount of zero padding (in con conv vcon olution. terms of test set classification accuracy) lies somewhere b etween “valid” and “same” In some cases, we do not actually wan wantt to use conv convolution, olution, but rather lo locally cally convolution. connected la lay yers (LeCun, 1986, 1989). In this case, the adjacency matrix in the In of some do same, not actually wanconnection t to use conv olution, rather lo cally graph ourcases, MLP we is the but every has its ownbut weigh eight, t, sp specified ecified connected layers (LeCun, 1986, 1989). In this case, the adjacency matrix in the 351connection has its own weight, sp ecified graph of our MLP is the same, but every
CHAPTER 9. CONVOLUTIONAL NETWORKS
... ... ...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Figure 9.13: The effe effect ct of zer zero o padding on network size size:: Consider a conv convolutional olutional netw network ork with a kernel of width six at every la er. In this example, we do not use an p ooling, so lay y any y The effe ct of zer o p adding on network size Figure 9.13: : Consider a conv olutional netw ork only the conv convolution olution op operation eration itself shrinks the net netw work size. (T (Top) op) In this conv convolutional olutional with a kernel of width sixany at every layer. this example, do not any p ooling, so net network, work, we do not use implicit zeroInpadding. Thiswe causes theuse representation to (T op) only thebyconv olution op eration itself shrinks the net w ork size. In this conv olutional shrink fiv pixels at eac la er. Starting from an input of sixteen pixels, we are only fivee each h lay y network, wee three do not use any implicit This causes to able to hav have con conv volutional la lay yers,zero andpadding. the last lay layer er do does es not the everrepresentation mo mov ve the kernel, shrink by fiveonly pixels h la yyer. an input of The sixteen we are only so arguably tw two oatofeac the la lay ers Starting are trulyfrom con convolutional. volutional. ratepixels, of shrinking can able to have three convsmaller olutionalkernels, layers, but and smaller the lastkernels layer doare es not move the b e mitigated by using less ever expressive andkernel, some so arguably only two in of this the kind layers truly convolutional. ratefiv ofe implicit shrinking can shrinking is inevitable of are architecture. (Bottom) ByThe adding five zero zeroes es b e each mitigated smaller kernels, but smaller kernels with are less expressive and to lay layer, er, by we using preven prevent t the representation from shrinking depth. This allo allows ws some us to shrinking is inevitable in this kind of architecture. mak make e an arbitrarily deep conv convolutional olutional net netw work. (Bottom) By adding five implicit zero es to each layer, we prevent the representation from shrinking with depth. This allows us to make an arbitrarily deep convolutional network.
352
CHAPTER 9. CONVOLUTIONAL NETWORKS
by a 6-D tensor W. The indices into W are resp respectiv ectiv ectively: ely: i , the output channel, j , the output row, k l , the output column, , the input channel, m, the row offset W W i , The b y a 6-D tensor . The indices into are resp ectiv ely: the output channel, within the input, and n , the column offset within the input. linear part of a jlo k l m , the output row, , the output column, , the input channel, , the row offset locally cally connected la lay yer is then giv given en by within the input, and n , the column offset within the input. The linear part of a X lo cally connected layer is then enl,jb+ym−1,k +n−1wi,j,k,l,m,n] . = giv[V Zi,j,k (9.9) l,m,n V Z = [ w ]. (9.9) This is sometimes also called unshar unshareed convolution onvolution,, b ecause it is a similar op operation eration to discrete conv convolution olution with a small kernel, but without sharing parameters across This is sometimes also called unshar d convolutioncon , b ecause it isand a similar op eration lo locations. cations. Fig. 9.14 compares lo local cal econnections, conv volution, full connections. to discrete convolution with aX small kernel, but without sharing parameters across Lo Locally cally connected la lay yers are useful when we kno know w that each feature should b e lo cations. Fig. 9.14 compares lo cal connections, convolution, and full connections. a function of a small part of space, but there is no reason to think that the same Lo cally connected layersallare we knowifthat feature be feature should o ccur across of useful space. when For example, we weach an tell ifshould an image antt to a function partonly of space, there is nomouth reasonintothe think thathalf theofsame is a pictureofofaa small face, we need tobut lo look ok for the b ottom the feature should o ccur across all of space. F or example, if we w an t to tell if an image image. is a picture of a face, we only need to lo ok for the mouth in the b ottom half of the It can also b e useful to make versions of conv convolution olution or lo locally cally connected la lay yers image. in whic the connectivit is further restricted, for example to constrain that each which h connectivity y It can also biebuseful to make of conv or lochannels cally connected layers l. A common output channel e a function ofversions only a subset ofolution the input inaywhic connectivit y isthe further for example to constrain each w to hdothe this is to make first restricted, connect to onlythat the first m output channels i l output channel b e a function of only a subset of the input channels . A common n input channels, the second m output channels connect to only the second n w ay tochannels, do this isand to make output channels connect to only the first input so on. the Seefirst Fig.m 9.15 for an example. Modeling in interactions teractions net m output n input the allo second to only thein second b etw ween channels, few channels allows ws the netw network orkchannels to hav havee connect fewer parameters order to input channels, and so on. See Fig. 9.15 for an example. Modeling interactions reduce memory consumption and increase statistical efficiency efficiency, , and also reduces b et w een few channels allo ws the netw ork to hav e fewer parameters in order the amoun amountt of computation needed to p erform forward and back-propagation. to It reduce memory consumption andreducing increase the statistical efficiency , and also reduces accomplishes these goals without num er of hidden units. umb b the amount of computation needed to p erform forward and back-propagation. It Tile Tiled d convolution (Gregor and LeCun , 2010a al.,of2010 ) offers a compromise accomplishes these goals without reducing the; nLe umetb er hidden units. b et etw ween a con conv volutional lay layer er and a lo locally cally connected lay layer. er. Rather than learning Tile d c onvolution ( Gregor and LeCun , 2010a ; Le et al. , 2010 compromise a separate set of weigh eights ts at every spatial lo location, cation, we learn )a offers set ofakernels that b et een a through convolutional er and a lo cally connected lay er. Rather than learning w ew rotate as wlay e mo e through space. This means that immediately mov v a separate of weigh ts ha atvevery spatial lo cation, learn set of kernels neigh neighb b oringset lo locations cations will hav e different filters, lik likee inwaelo locally cally aconnected lay layer, er,that but w e rotate through as w e mo v e through space. This means that immediately the memory requirements for storing the parameters will increase only by a factor neigh cations have different likesize in aoflothe callyen connected layfeature er, but of theb oring size oflothis set will of kernels, rather filters, than the entire tire output the memory requirements for storing of thelo parameters will increase onlycon byvaolution, factor map. See Fig. 9.16 for a comparison locally cally connected lay layers, ers, tiled conv of the size of this of kernels, rather than the size of the entire output feature and standard con conv vset olution. map. See Fig. 9.16 for a comparison of lo cally connected layers, tiled convolution, o define tiled conv convolution olution algebraically algebraically,, let k b e a 6-D tensor, where two of andTstandard convolution. the dimensions corresp correspond ond to differen differentt lo locations cations in the output map. Rather than k b e amap, T o define tiled conv olution algebraically , let 6-D output tensor,lo where o of ha having ving a separate index for eac each h lo location cation in the output locations cationstwcycle the dimensions ond differen lo cations map.If Rather than t differen t is equal through a set ofcorresp different t cto hoices of kternel stack in in the eachoutput direction. to having a separate index for each lo cation in the output map, output lo cations cycle 353 stack in each direction. If t is equal to through a set of t different choices of kernel
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
a
s2
b
c
s3
d
e
s4
f
g
s5
h
i
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
a
b
a
b
a
b
a
b
a
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
x1
x2
x3
x4
x5
Figure 9.14: Comparison of lo local cal connections, conv convolution, olution, and full connections. (T (Top) op) A lo locally cally connected lay layer er with a patch size of tw two o pixels. Each edge is lab labeled eled with 9.14: Comparison of lo calh connections, convolution, connections. aFigure unique letter to show that eac each edge is asso associated ciated with itsand ownfull weigh weight t parameter. (Top) A lo connected lay erwith withaa kernel patch width size of of twtw o pixels. is lab eled with (Center) Acally conv convolutional olutional la lay yer two o pixels.Each Thisedge mo model del has exactly a unique letter to show that eac h edge is asso ciated with its own weigh t parameter. the same connectivit connectivity y as the lo locally cally connected lay layer. er. The difference lies not in which units (Center) A conv olutional layer with a kernel widthare of tw o pixels. moconnected del has exactly in interact teract with eac each h other, but in ho how w the parameters shared. TheThis lo locally cally lay layer er the same connectivit y as theThe lo cally connected lay er.uses Thethe difference lies intswhich units has no parameter sharing. conv convolutional olutional la lay yer same tw two o wnot eigh eights rep repeatedly eatedly interactthe with eachinput, other, as butindicated in how the are shared. The lolab cally connected layer across entire byparameters the rep of the letters eac repetition etition labeling eling each h edge. has no parameter sharing. The conv olutional la y er uses the same tw o w eigh ts rep eatedly (Bottom) A fully connected la layer yer resem resembles bles a lo locally cally connected la lay yer in the sense that across the has entire input, as indicated byare theto rep etitiontooflab the letters labwith elingletters each edge. eac each h edge its own parameter (there too o many label el explicitly in this (Bottom) fully connected la yer resem bles a lo cally connected la y er in the sense that diagram). AHo ever, it do not hav the restricted connectivit of the lo connected How w does es havee connectivity y locally cally eac h edge has its own parameter (there are to o many to lab el explicitly with letters in this la lay yer. diagram). However, it do es not have the restricted connectivity of the lo cally connected layer.
354
CHAPTER 9. CONVOLUTIONAL NETWORKS
Output Tensor
Channel coordinates
Input Tensor
Spatial coordinates
Figure 9.15: A conv convolutional olutional net netw work with the first tw two o output channels connected to only the first tw two o input channels, and the second tw two o output channels connected to only Figure 9.15:tw A convolutional the second two o input channels.network with the first two output channels connected to only the first two input channels, and the second two output channels connected to only the second two input channels. 355
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1
a
s2
b
c
s3
d
e
s4
f
g
s5
h
i
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
a
b
c
d
a
b
c
d
a
x1
x2
x3
x4
x5
s1
s2
s3
s4
s5
a x1
b
a
b x2
a
b x3
a x4
b
a x5
Figure 9.16: A comparison of lo locally cally connected lay layers, ers, tiled con convolution, volution, and standard con conv volution. All three ha have ve the same sets of connections b et etw ween units, when the same Figure 9.16: A of lo cally connected ers, convolution, and standard size of kernel is comparison used. This diagram illustrates thelay use of tiled a kernel that is tw two o pixels wide. convdifferences olution. Allb et three ve metho the same setsinofho connections b et ween units, (T when thelo same The etw weenha the methods ds lies how w they share parameters. (Top) op) A locally cally size of kernel is has used. diagram thethat use each of a kernel that has is twits o pixels wide. connected lay layer er noThis sharing at all.illustrates We indicate connection own weigh weight t (T op) The differences etween the with metho lies inletter. how they shareTiled parameters. cally by lab labeling eling eachbconnection a ds unique (Center) con conv volution hasA aloset of er has Here no sharing at all. Wthe e indicate each connection its own tconnected t = 22. differen differentt lay kernels. we illustrate case of that . One of these has kernels hasweigh edgest (Center) b y lab eling each connection with a unique letter. Tiled con v olution has a set of lab labeled eled “a” and “b,” while the other has edges lab labeled eled “c” and “d.” Each time we mov move e one tpixel t = 2 differen t kernels. we illustrate the case of . One of these kernels has edges to the right in Here the output, we mo mov v e on to using a differen different t kernel. This means that, labeeled and connected “b,” while lay theer, other has edgesunits lab eled “c” output and “d.”hav Each time twe move one lik like the “a” lo locally cally layer, neigh neighb b oring in the have e differen different parameters. pixel to the lo right the output, we ve on using a differen t kernel. means that, Unlik Unlike e the locally callyinconnected lay layer, er,mo after wetohav have e gone through all t aThis vailable kernels, lik the loback cally to connected er, neigh oring units units in the are output have differen t parameters. w eecycle the firstlay kernel. If btw o output separated by a multiple of t Unlikethen the they lo cally connected layer,(Bottom) after weThav e gone conv through allist equiv available steps, share parameters. raditional convolution olution equivalent alent ktoernels, tiled w e cycle back to tthe kernel. If one two kernel outputand units separated by a multiple of t con conv volution with = 11.first . There is only it isare applied everywhere, as indicated (Bottom) steps, theyby share parameters. Traditional olution equivalent to tiled in the then diagram using the kernel with weigh weights ts lab labeled eled conv “a” and “b” is everywhere. convolution with t = 1. There is only one kernel and it is applied everywhere, as indicated in the diagram by using the kernel with weights lab eled “a” and “b” everywhere.
356
CHAPTER 9. CONVOLUTIONAL NETWORKS
the output width, this is the same as a lo locally cally connected la lay yer. X the output width,Zi,j,k this = is the same a lo cally connected layer. Vl,j +mas (9.10) −1,k +n−1Ki,l,m,n,j %t+1,k %t+1 , l,m,n Z V K = , (9.10) 1)%tt = 1, etc. It is where % is the mo modulo dulo op operation, eration, with t %t = 0, ( t + 1)% straigh straightforw tforw tforward ard to generalize this equation to use a differen differentt tiling range for eac each h t % t t + 1)% t where % is the mo dulo op eration, with = 0 , ( = 1 , etc. It is dimension. X this equation to use a different tiling range for each straightforward to generalize Both lo locally cally connected lay layers ers and tiled conv convolutional olutional lay layers ers hav havee an interesting dimension. in interaction teraction with max-p max-po o oling: the detector units of these lay layers ers are driven by Botht lo cally connected layerslearn and tiled convolutional ers have an versions interesting differen different filters. If these filters to detect differentlay transformed of in teraction with max-p o oling: the detector units of these lay ers are driven by the same underlying features, then the max-p max-pooled ooled units b ecome inv invarian arian ariantt to the differen t filters. If these filters learn to detect different transformed versions learned transformation (see Fig. 9.9). Conv Convolutional olutional lay layers ers are hard-co hard-coded ded to bofe the same underlying features, then the max-p ooled units b ecome inv arian t to the in inv variant sp specifically ecifically to translation. learned transformation (see Fig. 9.9). Convolutional layers are hard-co ded to b e Other op operations erations b esides conv convolution olution are usually necessary to implement a invariant sp ecifically to translation. con conv volutional netw network. ork. To p erform learning, one must b e able to compute the Other op erations b esides convolution aregradien usually necessary a gradien gradientt with resp respect ect to the kernel, giv given en the gradient t with resp respect ecttotoimplement the outputs. con v olutional netw ork. T o p erform learning, one must b e able to compute the In some simple cases, this op operation eration can be p erformed using the conv convolution olution gradien t with ect cases to theofkernel, given the gradien withofresp ect greater to the outputs. op operation, eration, butresp many interest, including the tcase stride than 1, In some simple cases, this op eration can be p erformed using the conv olution do not ha hav ve this prop propert ert erty y. op eration, but many cases of interest, including the case of stride greater than 1, Recall that conv convolution olution is a linear op operation eration and can thus b e describ described ed as a do not have this prop erty. matrix multiplication (if we first reshap reshapee the input tensor into a flat vector). The Recall that linear op erationkand thus b e describ ed as a matrix in inv volv olved edconv is a olution functionisofa the con conv volution ernel.can The matrix is sparse and matrix multiplication (if weisfirst reshap e the input tensorofinto flat vector). The eac each h elemen element t of the kernel copied to several elements the amatrix. This view matrixusintovolv ed eissome a function of theop con volution kernel. The matrixa con is sparse and helps deriv derive of the other operations erations needed to implement conv volutional eac h elemen t of the k ernel is copied to several elements of the matrix. This view net netw work. helps us to derive some of the other op erations needed to implement a convolutional Multiplication by the transp transpose ose of the matrix defined by conv convolution olution is one network. suc such h op operation. eration. This is the op operation eration needed to bac back-propagate k-propagate error deriv derivatives atives Multiplication b y the transp ose of the matrix defined by conv olution is one through a con conv volutional lay layer, er, so it is needed to train con conv volutional netw networks orks suc h op eration. This is the op eration needed to bac k-propagate error deriv atives that ha hav ve more than one hidden la lay yer. This same op operation eration is also needed if we through a convolutional layer,units so itfrom is needed to train con volutional netw orks wish to reconstruct the visible the hidden units (Simard et al. , 1992 ). that ha v e more than one hidden la y er. This same op eration is also needed if we Reconstructing the visible units is an op operation eration commonly used in the mo models dels wish to reconstruct the visible units from the hidden units ( Simard et al. , 1992 ). describ described ed in Part I I I of this b ook, such as auto autoenco enco encoders, ders, RBMs, and sparse co coding. ding. Reconstructing the visible units is an op eration commonly used in the mo dels Transp ranspose ose conv convolution olution is necessary to construct conv convolutional olutional versions of those describ in ePart I I of this b ook, op such as auto enco ders,gradient RBMs, and sparse can co ding. mo models. dels.edLik Like the Ikernel gradient operation, eration, this input op operation eration be T ransp ose conv olution is necessary to construct conv olutional versions of those implemen implemented ted using a conv convolution olution in some cases, but in the general case requires mo dels. Lik e the kernel gradient op eration, input eration canthis be a third op operation eration to b e implemented. Carethis must b e gradient tak taken en toopco coordinate ordinate implemen ted using a conv olution in some cases, but in the general case requires transp transpose ose op operation eration with the forw forward ard propagation. The size of the output that the a thirdoseopop eration b e implemented. must e taken ptoolicy co ordinate this transp transpose operation erationtoshould return dep depends endsCare on the zerobpadding and stride of transp ose op eration with the forward propagation. The size of the output that the 357 on the zero padding p olicy and stride of transp ose op eration should return dep ends
CHAPTER 9. CONVOLUTIONAL NETWORKS
the forw forward ard propagation op operation, eration, as well as the size of the forw forward ard propagation’s output map. In some cases, multiple sizes of input to forward propagation can the forw eration, as so well the size the forwmust ard propagation’s result in ard the propagation same size of op output map, theastransp transpose ose of op operation eration b e explicitly output map. In some cases, multiple sizes of input to forward propagation can told what the size of the original input was. result in the same size of output map, so the transp ose op eration must b e explicitly These three op operations—conv erations—conv erations—convolution, olution, backprop from output to weigh weights, ts, and told what the size of the original input was. bac backprop kprop from output to inputs—are sufficien sufficientt to compute all of the gradients These three an opyerations—conv olution,conv backprop from output weigh ts,train and needed to train any depth of feedforward convolutional olutional netw network, ork, asto well as to bac kprop fromnetw output inputs—are sufficien t to compute all the of the gradients con conv v olutional networks orks to with reconstruction functions based on transp transpose ose of needed to train an y depth of feedforward conv olutional netw ork, as w ell as to con conv volution. See Go Goodfellow odfellow (2010) for a full deriv derivation ation of the equations intrain the convolutional networks with reconstruction functions the of transp of fully general multi-dimensional, multi-example case. Tobased giv givee aonsense how ose these convolution. SeewGo odfellow for a full deriv ation of the equations in the equations work, e presen present t the(2010 two )dimensional, single example version here. fully general multi-dimensional, multi-example case. To give a sense of how these Supp Suppose ose we wan wantt to train a conv convolutional olutional netw network ork that incorp incorporates orates strided equations work, we present the two dimensional, single example version here. con conv volution of kernel stack K applied to multi-c multi-channel hannel image V with stride s as Supp ose w e wan t to train a conv olutional netw ork that incorp orates strided defined by c(K, V , s) as in Eq.K9.8. Supp Suppose ose we wan wantt to minimize Vsome loss function con of kernel stack applied to multi-channel image with stride s as J (Vv,olution K ). During K V forward propagation, we will need to use c itself to output Z , defined by to minimize function c( , , s) as in Eq. 9.8. Supp ose we wantnetwork whic which ork andsome used loss to compute Z V hKis then propagated through the rest of the netw J ( cost , ).function During Jforward we will to usee ca itself the . Duringpropagation, bac back-propagation, k-propagation, weneed will receiv receive tensortoG output suc such h that, whic then through the rest of the network and used to compute ∂ Jpropagated (V, K). Gi,j,kh=is ∂Z G i,j,k the cost function J . During back-propagation, we will receive a tensor such that K G To train the V network, we need to compute the deriv derivativ ativ atives es with respect to the = J ( netw , )ork, . weigh eights ts in the kernel. To do so, we can use a function To train the network, we need to compute the derivatives with respect to the weights in the kernel. To∂ do so, we can X use a function Gi,m,n Vj,(m−1)×s+k,(n−1)×s+l . g (G, V , s)i,j,k,l = J (V, K) = (9.11) ∂ Ki,j,k,l m,n G ∂ V G V V K g ( , , s) = K J( , ) = . (9.11) ∂ If this lay layer er is not the b ottom lay layer er of the net netw work, we will need to compute the gradien gradientt with resp respect ect to V in order to bac back-propagate k-propagate the error farther do down. wn. If this lay er is not the b ottom lay er of the net w ork, w e will need to compute To do so, we can use a function V the gradient with resp ect to in order X to back-propagate the error farther down. ∂ To do so, we can use a function J (V, K) (9.12) h(K, G, s)i,j,k = ∂ V i,j,k ∂X V K X X K G J( , ) (9.12) h( , , s) = V Kq,i,m,p G q,l,n. = (9.13) ∂ n,p q K l,m G s.t. s.t. = . (9.13) ( n− 1) ×s + p = k (l−1)×s+m=j Auto Autoencoder encoder net netw works, describ described ed in Chapter 14, are feedforw feedforward ard net netw works trained to copy their input to their AX simple example is the PCA algorithm, Xoutput. encoder netwxorks, describ ed in Chapter 14X , are feedforw works r using ard thatAuto copies its input to an approximate reconstruction the net function trained to. copy input to output. A simple example is the algorithm, >W x W It istheir common fortheir more general auto autoenco enco encoders ders to use PCA multiplication x r that copies its input to an approximate reconstruction using the by the transp transpose ose of the weigh weightt matrix just as PCA do does. es. T To o make suc such hfunction mo models dels W W x. It is common for more general auto enco ders to use multiplication by the transp ose of the weight matrix 358 just as PCA do es. To make such mo dels
CHAPTER 9. CONVOLUTIONAL NETWORKS
con conv volutional, we can use the function h to p erform the transp transpose ose of the conv convolution olution op operation. eration. Suppose we ha hav ve hidden units H in the same format as Z and we define volutional, we can use the function h to p erform the transp ose of the convolution acon reconstruction H Z op eration. Suppose we have hidden units R= h(K, Hin, sthe ). same format as and we define (9.14) a reconstruction R K H = h(we, will , s).receiv (9.14) In order to train the auto autoenco enco encoder, der, receivee the gradient with resp respect ect to R as a tensor E. To train the deco decoder, der, we need to obtain the gradient with In order to train the auto enco der, we will receiveencoder, theder, gradient with ect resp respect we need to resp obtain Rect to K. ThisEis given by g (H, E, s). To train the enco to as a tensor . T o train the deco der, we need to obtain the gradient with the gradien gradient respect ect to H . H This given en by c (K, E, s). It is also p ossible to Kt with resp E is giv ( , h,, sbut resp ect tiate to .through This is ggiven byc gand ). Tthese o train the enco der, we need to obtain differen differentiate using op operations erations K E are not needed for the H ( , arc , shitectures. the gradient with algorithm resp ect toon .an This is givennet bywcork ). It is also p ossible to bac back-propagation k-propagation any y standard netw architectures. differentiate through g using c and h, but these op erations are not needed for the Generally Generally,, we algorithm do not useon only linear op operation eration to transform from back-propagation anyastandard net work in arcorder hitectures. the inputs to the outputs in a conv convolutional olutional lay layer. er. W Wee generally also add some , we output do not buse only a linear eration in .order transform from biasGenerally term to each efore applying theopnonlinearity nonlinearity. This to raises the question thehow inputs to theparameters outputs inamong a convolutional er. Wcally e generally alsolay add of to share the biases.layF For or lo locally connected layers erssome it is bias term to each output b efore applying the nonlinearity . This raises the question natural to giv givee eac each h unit its own bias, and for tiled con conv volution, it is natural to of how to share parameters among the biases. F or lo cally connected lay ers it is share the biases with the same tiling pattern as the kernels. For conv convolutional olutional natural e each own bias, for tiled volution, is natural to la lay yers, ittois giv typical tounit ha hav veits one bias p er and channel of thecon output and itshare it across share the biases with the same tiling pattern aser, the kernels. olutional all lo locations cations within each conv convolution olution map. How Howev ev ever, if the input Fisorofconv kno known, wn, fixed la y ers, it is t ypical to ha v e one bias p er c hannel of the output and share it across size, it is also p ossible to learn a separate bias at eac each h lo location cation of the output map. all lo cations within each conv olution map. How ev er, if the input is of kno wn, fixed Separating the biases ma may y slightly reduce the statistical efficiency of the mo model, del, but size, allo it isws also to correct learn a for separate bias atineac lo cation of the output map. also allows thep ossible mo model del to differences theh image statistics at differen different t Separating the biases ma y slightly reduce the statistical efficiency of the mo del, but lo locations. cations. For example, when using implicit zero padding, detector units at the also ws the moreceiv del toecorrect for input differences inythe image statistics edge allo of the image receive less total and ma may need larger biases. at different lo cations. For example, when using implicit zero padding, detector units at the edge of the image receive less total input and may need larger biases.
9.6
Structured Outputs
Con Conv netw networks orksOutputs can be used to output a high-dimensional, structured 9.6volutional Structured ob object, ject, rather than just predicting a class lab label el for a classification task or a real Con v olutional netw orks can be used to output structured value for a regression task. Typically this ob object jectaishigh-dimensional, just a tensor, emitted by a ob ject, rather than justlay predicting a class the lab el for classification task S or, where a real standard conv convolutional olutional layer. er. For example, mo model delamigh might t emit a tensor vSalue isforthe a probability regression task. Typically this ject to is just a tensor, emitted by a that pixel (j, k ) of theob input the netw network ork b elongs i,j,k Sto class conv layto er.lab For theinmo mighand t emit a tensor where istandard allows wsolutional the mo model del label el example, every pixel andel image draw precise, masks S. This allo j, k is the probability that pixel ( ) of the input to the netw ork b elongs to class that follo follow w the outlines of individual ob objects. jects. i. This allows the mo del to lab el every pixel in an image and draw precise masks One issue that often comes up is that the output plane can b e smaller than the that follow the outlines of individual ob jects. input plane, as shown in Fig. 9.13. In the kinds of arc architectures hitectures typically used for One issue that often comes up is that the output plane can b e smaller the classification of a single ob object ject in an image, the greatest reduction in thethan spatial input plane, as shown in Fig. 9.13 . In the kinds of arc hitectures t ypically used dimensions of the netw network ork comes from using po pooling oling lay layers ers with large stride. for In classification of a single ob ject in an image, the greatest reduction in the spatial dimensions of the network comes from359 using po oling layers with large stride. In
CHAPTER 9. CONVOLUTIONAL NETWORKS
ˆ (1) Y V
W
ˆ (2) Y V
H(1) U
W H(2)
U
ˆ (3) Y V H(3)
U
X
Figure 9.17: An example of a recurren recurrentt conv convolutional olutional net netw work for pixel lab labeling. eling. The input is an image tensor X, with axes corresp corresponding onding to image rows, image columns, and example a recurren t conv olutional netofwork forYˆpixel eling. The cFigure hannels9.17: (red,An green, blue). goal is to output a tensor lab labels els , withlab a probabilit probability y XofThe input is an image tensor , with axes corresp onding to image rows, image columns, and distribution ov over er labels for eac each h pixel. This tensor has axes corresp corresponding onding to image rows, channels (red, green, blue). The goal is toRather output than a tensor of lab elsYˆYˆin , with a probabilit y image columns, and the different classes. outputting a single shot, the ˆ byaxes distribution over for eachrefines pixel. its This tensor Y has corresp onding estimate to image of rows, Yˆ recurren recurrent t netw network orklabels iteratively estimate using a previous Yˆ in for image columns, and the different classes. Rather outputting a single the as input for creating a new estimate. The same than parameters are used each shot, up updated dated ˆ ˆ Y recurrent netw ork iteratively refines its estimate y using previous of Y estimate, and the estimate can b e refined as man many ybtimes asawe wish. estimate The tensor of as input forkernels creating Thetosame parameters arerepresentation used for each given up dated con conv volution used estimate. on each step compute the hidden the U ais new estimate, and the estimate can b e refined as man y times as we wish. The tensor of input image. The kernel tensor V is used to pro produce duce an estimate of the lab labels els given the U ˆ convolution kernels used eachstep, stepthe to compute theare hidden representation the hidden values. On all isbut theonfirst kernels W con conv volved ov over er Y togiven pro provide vide V image. The kernel is used to step, pro duce an estimate of the lab els given the input to the hidden lay layer. er.tensor On the first time this term is replaced b y zero. Because W hidden values. On allare butused the on first step, thethis kernels are conof volved over Yˆ net to pro vide the same parameters each step, is an example a recurrent ork, as netw w input to the hidden lay er. On the first time step, this term is replaced b y zero. Because describ described ed in Chapter 10. the same parameters are used on each step, this is an example of a recurrent network, as describ ed in Chapter 10.
order to pro produce duce an output map of similar size as the input, one can av avoid oid p o oling altogether (Jain et al., 2007). Another strategy is to simply emit a low lower-resolution er-resolution order to pro duce an output map of similar size as the input, one can av p ocould oling grid of lab labels els (Pinheiro and Collob Collobert ert, 2014, 2015). Finally Finally,, in principle,oid one altogether ( Jain et al. , 2007 ). Another strategy is to simply emit a low er-resolution use a p ooling op operator erator with unit stride. grid of lab els (Pinheiro and Collob ert, 2014, 2015). Finally, in principle, one could One strategy for pixel-wise lab labeling eling of images is to pro produce duce an initial guess use a p ooling op erator with unit stride. of the image lab labels, els, then refine this initial guess using the interactions b etw etween een One strategy for pixel-wise lab eling of images is to pro duce an initial guess neigh neighb b oring pixels. Rep Repeating eating this refinemen refinementt step sev several eral times corresp corresponds onds to of the image lab els, then refine this initial guess using the interactions b using the same conv convolutions olutions at eac each h stage, sharing weigh weights ts b etw etween een the lastetw la lay yeen ers neigh pixels. Rep this).refinemen t step sevsequence eral timesofcorresp onds to of theb oring deep net (Jain eteating al., 2007 This makes the computations using the same conv olutions con at eac h stage, la sharing weigh ts btsetw een the last lay layers p erformed by the successive conv volutional lay yers with weigh weights shared across layers ers of the deep net ( Jain et al. , 2007 ). This makes the sequence of computations a particular kind of recurrent netw network ork (Pinheiro and Collob Collobert ert, 2014, 2015). Fig. p erformed b y the successive con v olutional la y ers with weigh ts shared across layers 9.17 sho shows ws the arc architecture hitecture of suc such h a recurren recurrentt con conv volutional net netw work. a particular kind of recurrent network (Pinheiro and Collob ert, 2014, 2015). Fig. Once a prediction for eac each h pixel is made, various metho methods ds can b e used to 9.17 shows the architecture of such a recurrent convolutional network. further process these predictions in order to obtain a segmentation of the image a prediction pixel; T isuraga made,etvarious metho ds et canetbal. e used in into toOnce regions (Briggmanforet eac al.h , 2009 al. al.,, 2010 ; Farab arabet al., , 2013to). further process these predictions in order to obtain a segmentation of the image into regions (Briggman et al., 2009; T360 uraga et al., 2010; Farab et et al., 2013).
CHAPTER 9. CONVOLUTIONAL NETWORKS
The general idea is to assume that large groups of contiguous pixels tend to b e asso associated ciated with the same lab label. el. Graphical mo models dels can describ describee the probabilistic The general idea is to assume that large groups of contiguous tend to be relationships b et etw ween neigh neighb b oring pixels. Alternativ Alternatively ely ely,, the con conv vpixels olutional net netw work asso ciated with to themaximize same lab el. mo dels e the probabilistic can b e trained an Graphical appro approximation ximation of can the describ graphical mo model del training relationships b et w een neigh b oring pixels. Alternativ ely , the con v olutional network ob objective jective (Ning et al., 2005; Thompson et al., 2014). can b e trained to maximize an approximation of the graphical mo del training ob jective (Ning et al., 2005; Thompson et al., 2014).
9.7
Data Typ ypes es
The used with conv convolutional olutional netw network ork usually consists of sev several eral channels, 9.7 dataData Typa es eac each h channel b eing the observ observation ation of a different quantit quantity y at some p oint in space The dataSee used with9.1 a conv olutional netw orktusually consists of sev eral channels, or time. Table for examples of data yp ypes es with different dimensionalities eachncum hannel the observation of a different quantity at some p oint in space and umb b er ofb eing channels. or time. See Table 9.1 for examples of data typ es with different dimensionalities For an example of conv convolutional olutional net netw works applied to video, see Chen et al. and numb er of channels. (2010). For an example of convolutional networks applied to video, see Chen et al. So far we hav havee discussed only the case where every example in the train and test (2010). data has the same spatial dimensions. One adv advantage antage to conv convolutional olutional netw networks orks So far we hav e discussed only the case where every example in the train and is that they can also pro process cess inputs with varying spatial extents. These kindstest of data has the same spatial dimensions.byOne advantage to conv olutional networks input simply cannot b e represented traditional, matrix multiplication-based is that net they can also cess inputs with varying Thesenet kinds of neural netw works. Thispro pro provides vides a comp compelling elling reasonspatial to useextents. conv convolutional olutional netw works input simply cannot b e represented traditional, multiplication-based ev even en when computational cost and ovby erfitting are notmatrix significan significant t issues. neural networks. This provides a comp elling reason to use convolutional networks For example, consider a collection of images, where each image has a differen differentt even when computational cost and overfitting are not significant issues. width and height. It is unclear how to mo model del such inputs with a weigh eightt matrix of For example, consider a collection of images, where image has aapplied differenat fixed size. Con Conv volution is straigh straightforw tforw tforward ard to apply; the each kernel is simply width and height. is unclear how to such inputs with weigh t matrix of differen different t num numb b er ofIttimes dep depending ending on mo thedel size of the input, anda the output of the fixed size. Con v olution is straigh tforw ard to apply; the k ernel is simply applied con conv volution operation scales accordingly accordingly.. Con Convolution volution may b e view viewed ed as matrixa differen t numb er of same timescon depvending the size of the input, and the multiplication; the olution on kernel induces a different sizethe of output doubly of blo conv bloc ck con v olution operation scales accordingly . Con volution may b e view ed as matrix circulan circulantt matrix for eac each h size of input. Sometimes the output of the netw network ork is m ultiplication; the same con v olution kernel induces a different size of doubly blo ck allo allow wed to hav havee variable size as well as the input, for example if we wan antt to assign circulan t matrix for eac h size of input. Sometimes the output of the netw ork is a class lab label el to each pixel of the input. In this case, no further design work is allowed to. hav e variable size asnet well as the input, for example if we wanoutput, t to assign necessary In other cases, the must pro some fixed-size for necessary. netw work produce duce a class lab el to each pixel of the input. In this case, no further design work is example if we wan antt to assign a single class lab label el to the entire image. In this case necessary . In other the net worksteps, must lik proeduce someafixed-size output, for w e must make some cases, additional design like inserting p o oling la lay yer whose example if w e w an t to assign a single class lab el to the entire image. In this case p ooling regions scale in size prop proportional ortional to the size of the input, in order to w e m ust make some additional design steps,Some like examples inserting of a pthis o oling yerstrategy whose main maintain tain a fixed num number ber of p ooled outputs. kindlaof p ooling regions are sho shown wn in Fig.scale 9.11.in size prop ortional to the size of the input, in order to maintain a fixed number of p ooled outputs. Some examples of this kind of strategy the use of. con conv volution for pro processing cessing variable sized inputs only mak makes es are Note shownthat in Fig. 9.11 sense for inputs that hav havee variable size b ecause they contain varying amounts Note that the use of convolution for pro cessing variable sized inputs only makes sense for inputs that have variable size b ecause they contain varying amounts 361
CHAPTER 9. CONVOLUTIONAL NETWORKS
1-D 1-D
2-D 2-D
3-D 3-D
Single channel Audio w wa aveform: The axis w wee Single c hannel con conv volv olvee over corresp corresponds onds to Audio w a v eform: The axisand we time. We discretize time con v olv e o v er corresp onds to measure the amplitude of the time. Weonce discretize w aveform p er timetime step.and measure the amplitude of the waveform once p er time step.
Audio data that has b een preprocessed with a Fourier transform: Audio that has een preproW e candata transform thebaudio wa wavevecessed with a F ourier transform: form into a 2D tensor with difW e can transform the audioto wadifveferen ferent t ro rows ws corresp corresponding onding form a 2D tensor differen ferenttinto frequencies and with differen different t feren t ro ws corresp onding to difcolumns corresponding to differferen t frequencies and differen t en ent t p oin oints ts in time. Using con conv volucolumns corresponding to differtion in the time makes the mo model del en t p oin ts into time. Using convoluequiv equivariant ariant shifts in time. Ustion conv in the time makes mofredel ing convolution olution acrossthethe equivariant shifts inthe time. Usquency axistomakes model ing conv olution across the freequiv equivariant ariant to frequency frequency,, so that quency makes the sameaxis melo ed inmodel a difmelody dy pla play ythe equiv ariant to frequency , so that feren ferentt o cta ctav ve pro produces duces the same the sametation melo dy in a difrepresen representation butpla atyed a different feren t o cta v e pro duces the same heigh heightt in the net netw work’s output. represen tation but atAa common different V olumetric data: height of in this the kind network’s output. source of data is medV olumetric data: A common ical imaging technology technology,, such as source of this kind of data is medCT scans. ical imaging technology, such as 9.1: differentt formats of CTExamples scans. of differen
Multi-c Multi-channel hannel Sk Skeleton eleton animation data: AnimaMulti-c tions ofhannel 3-D computer-rendered Sk eleton data:by Animacharactersanimation are generated altertions of 3-D computer-rendered ing the p ose of a “sk “skeleton” eleton” ov over er characters are generated by altertime. At each p oint in time, the ing p osecharacter of a “skeleton” oved er p osethe of the is describ described time. Aecification t each p oint in time, b y a sp specification of the anglesthe of p ose of the character is describ ed eac each h of the joints in the characb y a ecification of the angles in of ter’s sp sk skeleton. eleton. Each channel eachdata of the thecon characthe we joints feed tointhe conv voluter’s sk eleton. Each channel in tional mo model del represen represents ts the angle the data we feed to the con v oluab about out one axis of one join joint. t. tional mo del represen ts the angle Color image data: One channel ab out onethe axis of pixels, one join t. the con contains tains red one Color image Onethe channel green pixels, data: and one blue con tains the red pixels, one the pixels. The con convolution volution kernel green pixels, and one the blue mo mov ves ov over er both the horizontal pixels. The axes convolution kernel and vertical of the image, moves over translation both the horizontal conferring equiv equivariariand vertical axes of the image, ance in b oth directions. conferring translation equivariance in b oth directions.
Color video data: One axis corresp sponds onds to time, one to the height Color Oneand axisone correof thevideo videodata: frame, to sp onds to time, one to the height the width of the video frame. of the video frame, and one to data that can b e used with conv convolutional olutional the width of the video frame.
Table net netw works. Table 9.1: Examples of different formats of data that can b e used with convolutional networks. 362
CHAPTER 9. CONVOLUTIONAL NETWORKS
of observ observation ation of the same kind of thing—differen thing—differentt lengths of recordings over time, differen differentt widths of observ observations ations ov over er space, etc. Con Conv volution do does es not make of observ ation of the same kind of thing—differen t lengths of recordings over sense if the input has variable size because it can optionally include different time, differen t widths observ ationsif ov space, etc. Con volution do es not make kinds of observ observations. ations. of For example, weerare pro processing cessing college applications, and sense if the input has v ariable size because it can optionally include different our features consist of b oth grades and standardized test scores, but not every kinds of tobserv For example, we are pro college applican applicant to took ok ations. the standardized test,ifthen it do does escessing not make senseapplications, to con conv volv olvee and the our features consist of b oth grades and standardized test scores, but not every same weigh eights ts over b oth the features corresp corresponding onding to the grades and the features applican t to ok the standardized test, then it do es not make sense to convolve the corresp corresponding onding to the test scores. same weights over b oth the features corresp onding to the grades and the features corresp onding to the test scores.
9.8
Efficien Efficientt Con Conv volution Algorithms
Mo Modern dernEfficien con conv volutional netw network ork applications often inv involve olve net netw works containing more 9.8 t Con v olution Algorithms than one million units. Po Powerful werful implementations exploiting parallel computation Mo dern con v olutional netw ork 12.1 applications often olve netin works resources, as discussed in Sec. , are essen essential. tial. inv Ho How wev ever, er, man many ycontaining cases it ismore also than onetomillion Po werfulbimplementations exploitingconv parallel computation p ossible sp speed eed units. up con conv volution y selecting an appropriate convolution olution algorithm. resources, as discussed in Sec. 12.1, are essential. However, in many cases it is also Con Conv volution is equiv equivalen alen alentt to conv converting erting b oth the input and the kernel to the p ossible to sp eed up convolution by selecting an appropriate convolution algorithm. frequency domain using a Fourier transform, p erforming p oin oint-wise t-wise multiplication Con v olution is equiv alen t to conv erting b oth the input and kernel toerse the of the tw two o signals, and con conv verting bac back k to the time domain the using an inv inverse frequency domain using Fourier transform, p oin t-wisethan multiplication F ourier transform. For asome problem sizes,p erforming this can be faster the naive of the tw o signals, and con v erting bac k to the time domain using an inverse implemen implementation tation of discrete con conv volution. Fourier transform. For some problem sizes, this can be faster than the naive When a d-dimensional kernel can b e expressed as the outer pro product duct of d implementation of discrete convolution. vectors, one vector p er dimension, the kernel is called sep separ ar arable able able.. When the kernel When a dnaive -dimensional kernel can b e expressed as alent the outer pro duct of d is separable, conv convolution olution is inefficient. It is equiv equivalent to comp compose ose d onevdimensional ectors, one vcon ector p er dimension, theofkernel called sepThe arable . When kernel conv volutions with each theseisvectors. comp composed osed the approach d is separable, naive conv olution is inefficient. It is equiv alent to comp ose oneis significantly faster than p erforming one d-dimensional conv convolution olution with their dimensional conThe volutions theseparameters vectors. The comp osedasapproach outer pro product. duct. kernelwith alsoeach takesoffewer to represent vectors. is the significantly faster than p erforming conveolution with their d-dimensional If kernel is w elemen elements ts wide in eachone dimension, then naiv naive multidimensional outer pro duct. The Okernel also takes parameters represent vectors. (w d) runtime con conv volution requires andfewer parameter storagetospace, whileas separable w If the kernel requires is elemen intime eachand dimension, then naive multidimensional O (wts×wide d) run con conv volution runtime parameter storage space. Of course, O ( w con v olution requires ) runtime and parameter storage space, while separable not ev every ery con conv volution can b e represen represented ted in this wa way y. convolution requires O (w d) runtime and parameter storage space. Of course, wa ways ys conv olution convolution olution not Devising every confaster volution canof× b ep erforming representedcon invthis way.or approximate conv without harming the accuracy of the mo model del is an activ activee area of researc research. h. Ev Even en techDevising faster wa ys of p erforming con v olution or approximate conv niques that impro improv ve the efficiency of only forw forward ard propagation are useful bolution ecause without harming the accuracy of the mo del is an activ e area of researc h. Ev en techin the commercial setting, it is typical to dev devote ote more resources to deplo deploymen ymen yment t of thatthan impro e the efficiency of only forward propagation are useful b ecause aniques net netw work tovits training. in the commercial setting, it is typical to devote more resources to deployment of a network than to its training.
363
CHAPTER 9. CONVOLUTIONAL NETWORKS
9.9
Random or Unsup Unsupervised ervised Features
T ypically ypically, , the most exp expensive part of conv convolutional olutional netw network ork training is learning the 9.9 Random orensive Unsup ervised Features features. The output la lay yer is usually relatively inexp inexpensiv ensiv ensivee due to the small num umb b er T ypically , the most exp ensive part of conv olutional netw ork training is learning the of features provided as input to this la lay yer after passing through several lay layers ers of features. The output la y er is usually relatively inexp ensiv e due to the small n um b ert p ooling. When p erforming supervised training with gradien gradientt descen descent, t, ev every ery gradien gradient of features provided as input to forward this layerpropagation after passing through several layers of step requires a complete run of and backw backward ard propagation p ooling. the When p erforming with t, every gradien through en entire tire net netw work.supervised One waytraining to reduce thegradien cost oft descen conv convolutional olutional netw network orkt step requires a complete run of forward propagation and backw ard propagation training is to use features that are not trained in a sup supervised ervised fashion. through the entire network. One way to reduce the cost of convolutional network Thereis are three basic that strategies obtaining conv convolution olution kernels without training to use features are notfor trained in a sup ervised fashion. sup supervised ervised training. One is to simply initialize them randomly randomly.. Another is to There areby three basic forsetting obtaining olution kernels without design them hand, for strategies example by eachconv kernel to detect edges at a sup ervised training. One is to simply initialize them randomly . Another is to certain orientation or scale. Finally Finally,, one can learn the kernels with an unsup unsupervised ervised design them hand, for example by(2011 setting eachkk-means ernel toclustering detect edges at a criterion. Forbyexample, Coates et al. ) apply to small certainpatches, orientation or use scale. Finally , onecentroid can learnasthe kernels with kernel. an unsupP ervised image then each learned a con conv volution Part art I I I k criterion. F or example, Coates et al. ( 2011 ) apply -means clustering to small describ describes es many more unsup unsupervised ervised learning approaches. Learning the features imageanpatches, then use each learned centroid a convolution kernel. from Part the III with unsup unsupervised ervised criterion allows them to b easdetermined separately describ es many more unsup ervised learning approaches. Learning the features classifier la lay yer at the top of the arc architecture. hitecture. One can then extract the features for with an unsup ervised them constructing to b e determined from the entire training set criterion just once,allows essen essentially tially a newseparately training set for the the classifier la y er at the top of the arc hitecture. One can then extract the features for last la lay yer. Learning the last la lay yer is then typically a con conv vex optimization problem, the entire the training set once, essenlik tially constructing a new training assuming last la lay yerjust is something like e logistic regression or an SVM.set for the last layer. Learning the last layer is then typically a convex optimization problem, Random filters often work surprisingly well in conv convolutional olutional net netw works (Jarrett assuming the last layer is something like logistic regression or an SVM. et al., 2009; Saxe et al., 2011; Pinto et al., 2011; Co Cox x and Pinto, 2011). Saxe et al. Random filters ork surprisingly in conv olutional worksnaturally (Jarrett (2011 ) sho show wed that often la lay yerswconsisting of con conv vwell olution follo following wing by net p o oling et al. , 2009 ; Saxe et al. , 2011 ; Pinto et al. , 2011 ; Co x and Pinto , 2011 ). Saxe et al. b ecome frequency selectiv selectivee and translation in inv varian ariantt when assigned random weigh weights. ts. (They 2011)argue showed that la y ers consisting of con v olution follo wing b y p o oling naturally that this provides an inexp inexpensive ensive way to choose the architecture of frequencynetw selectiv and ev translation t when assigned random weights. ab ecome conv convolutional olutional network: ork:efirst evaluate aluate theinvparian erformance of several conv convolutional olutional They argue that this provides an inexp ensive ay yer, to then choose thethe architecture of net netw work arc architectures hitectures by training only the lastwla lay take b est of these a conv olutional ork: evaluate the p erformance of several conv olutional arc architectures hitectures andnetw train thefirst en entire tire arc architecture hitecture using a more exp expensiv ensiv ensive e approac approach. h. network architectures by training only the last layer, then take the b est of these An in intermediate termediate approac approach h tire is toarc learn the features, metho methods ds that h. do architectures and train the en hitecture using a but moreusing exp ensiv e approac not require full forward and back-propagation at every gradient step. As with An yin approac h greedy is to learn the features, but using metho that do multila ultilay ertermediate p erceptrons, we use la lay yer-wise pretraining, to train thedsfirst lay layer er not require full forward and back-propagation at every gradient step. As with in isolation, then extract all features from the first lay layer er only once, then train the m ultila y er p erceptrons, we use greedy la y er-wise pretraining, to train the describ first layed er second la lay yer in isolation given those features, and so on. Chapter 8 has described in isolation, thensup extract features from the first layer only then train this the ho how w to p erform supervised ervisedallgreedy lay layer-wise er-wise pretraining, and Ponce, art I II extends second layla eryer-wise in isolation given those and so on. Chapter has describ ed to greedy lay pretraining usingfeatures, an unsup unsupervised ervised criterion at 8eac each h lay layer. er. The how to p erform supof ervised layer-wise pretraining, andolutional Part I II mo extends canonical example greedygreedy lay layer-wise er-wise pretraining of a conv convolutional model del is this the to greedy la y er-wise pretraining using an unsup ervised criterion at eac h lay er. The con conv volutional deep b elief net netw work (Lee et al., 2009). Con Conv volutional net netw works offer canonical example of greedy layer-wise pretraining of a convolutional mo del is the convolutional deep b elief network (Lee et al., 2009). Convolutional networks offer 364
CHAPTER 9. CONVOLUTIONAL NETWORKS
us the opp opportunit ortunit ortunity y to take the pretraining strategy one step further than is p ossible with multila ultilay yer p erceptrons. Instead of training an entire conv convolutional olutional la lay yer at a us the opp ortunit y to take the pretraining strategy one step further than is ossible time, we can train a mo model del of a small patc patch, h, as Coates et al. (2011) do with kp-means. with multila er pthe erceptrons. Instead training an entire conv laykernels er at a W e can thenyuse parameters from of this patch-based mo model del toolutional define the time, we can train a mo del of a small patc h, as Coates et al. ( 2011 ) do with -means. k of a conv convolutional olutional lay layer. er. This means that it is p ossible to use unsup unsupervised ervised learning W can then use the parameters this patch-based delolution to defineduring the kernels toetrain a conv convolutional olutional net netw work from without ev ever er using mo conv convolution the of a conv olutional lay er. This means that it is p ossible to use unsup ervised learning models dels and incur a training pro process cess. Using this approach, we can train very large mo without ev er using conv olution duringetthe to train a conv olutional net w ork high computational cost only at inference time (Ranzato et al., 2007b; Jarrett al., training pro cess . Using this approach, we can train very large mo dels and incur a 2009; Ka Kavuk vuk vukcuoglu cuoglu et al. al.,, 2010; Coates et al. al.,, 2013). This approach was p opular high computational cost only at inference time (Ranzato al., 2007b ; Jarrett et al., from roughly 2007–2013, when lab labeled eled datasets were et small and computational 2009 ; Ka vukmore cuoglu et al., 2010 ; Coates al.,olutional 2013). This p opular p ower was limited. To day day, , most et conv convolutional net netw wapproach orks are was trained in a from roughly 2007–2013, when lab eled datasets were small and computational purely sup supervised ervised fashion, using full forward and back-propagation through the p o w er w as more To dayiteration. , most convolutional networks are trained in a en entire tire net netw work onlimited. eac each h training purely sup ervised fashion, using full forward and back-propagation through the As with other approaches to unsup unsupervised ervised pretraining, it remains difficult to entire network on each training iteration. tease apart the cause of some of the b enefits seen with this approac approach. h. Unsupervised As with other approaches to unsup ervised pretraining, it remains to pretraining ma may y offer some regularization relativ relativee to sup supervised ervised training,difficult or it may tease apart the cause of some of the b enefits seen with this approac h. Unsupervised simply allow us to train muc much h larger architectures due to the reduced computational pretraining may offerrule. some regularization relative to sup ervised training, or it may cost of the learning simply allow us to train much larger architectures due to the reduced computational cost of the learning rule.
9.10
The Neuroscientific Basis for Con Conv volutional Networks 9.10 The Neuroscientific Basis for Convolutional NetCon net Conv volutional netw works are p erhaps the greatest success story of biologically works inspired artificial intelligence. Though conv convolutional olutional netw networks orks hav havee b een guided Con v olutional net w orks are p erhaps the greatest success story of biologically by many other fields, some of the key design principles of neural netw networks orks were inspired artificial intelligence. Though conv olutional netw orks hav e b een guided dra drawn wn from neuroscience. by many other fields, some of the key design principles of neural networks were The history of conv convolutional olutional net netw works b egins with neuroscien neuroscientific tific exp experimen erimen eriments ts drawn from neuroscience. long b efore the relev relevant ant computational mo models dels were dev developed. eloped. Neuroph Neurophysiologists ysiologists The history of conv olutional net w orks b egins with neuroscien tific exp erimen tsy Da David vid Hub Hubel el and Torsten Wiesel collab collaborated orated for sev several eral years to determine man many long b efore the relev ant computational mo dels were dev eloped. Neuroph ysiologists of the most basic facts about how the mammalian vision system works (Hub Hubel el and David ,Hub collab orated for sev eralev years to determine Wiesel 1959el, and 1962T , orsten 1968). Wiesel Their accomplishments were even en entually tually recognizedman withy the el most basic facts about that how ha thevemammalian visioninfluence system works (Hub el and aofNob Nobel prize. Their findings hav had the greatest on contemporary Wiesel , 1959, 1962 , 1968 ). Their were ev recognized with deep learning mo models dels were based accomplishments on recording the activit activity yen oftually individual neurons in acats. NobThey el prize. Their findings that ha v e had the greatest influence on contemporary observed how neurons in the cat’s brain resp responded onded to images pro projected jected deep learning mo dels w ere based on recording the activit y of individual neurons in in precise lo locations cations on a screen in front of the cat. Their great disco discovery very was cats. neurons They observed how neurons in the cat’s brain resp to images pro jected that in the early visual system resp responded onded mostonded strongly to very sp specific ecific in precise lo cations on a screen in front of the cat. Their great disco very patterns of ligh light, t, such as precisely oriented bars, but resp responded onded hardly at allwas to that neurons in the early visual system resp onded most strongly to very sp ecific other patterns. patterns of light, such as precisely oriented bars, but resp onded hardly at all to 365 other patterns.
CHAPTER 9. CONVOLUTIONAL NETWORKS
Their work help helped ed to characterize many asp aspects ects of brain function that are b ey eyond ond the scop scopee of this b ook. From the p oint of view of deep learning, we can Their help edcartoon to characterize many asp ects of brain function that are fo focus cus on awork simplified, view of brain function. b eyond the scop e of this b ook. From the p oint of view of deep learning, we can In on thisa simplified we fo focus cus on a partfunction. of the brain called V1, also known as fo cus simplified,view, cartoon view of brain the primary visual cortex ortex.. V1 is the first area of the brain that b egins to p erform In thistly simplified we fo cus a part of the calledonV1 , alsoimages knownare as significan significantly adv advanced ancedview, pro processing cessing of on visual input. Inbrain this carto cartoon view, the primary visual cortexin . V1 theand firststimulating area of the the brain that the b egins to p erforme formed by ligh light t arriving theiseye retina, light-sensitiv light-sensitive significan anced pro cessing of visual input. In this carto on view, images are tissue in tly theadv back of the eye. The neurons in the retina p erform some simple formed b y ligh t arriving in the eye and stimulating the retina, the light-sensitiv e prepro preprocessing cessing of the image but do not substantially alter the wa way y it is represented. tissue in the back of the eye. the The neurons the retinaregion p erform some The image then passes through optic nerv nervee in and a brain called the simple lateral prepro cessing of the image but do not substantially alter the wa y it is represented. geniculate nucleus. The main role, as far as we are concerned here, of both of these The image then passes through the nervethe andsignal a brain region called anatomical regions is primarily justoptic to carry from the eye to the V1, lateral whic which h geniculate n ucleus. The main role, as far as w e are concerned here, of both of these is lo located cated at the bac back k of the head. anatomical regions is primarily just to carry the signal from the eye to V1, which A con conv volutional net netw work la lay yer is designed to capture three prop properties erties of V1: is lo cated at the back of the head. conis volutional layermap. is designed to capture prop ertiesstructure of V1: 1.A V1 arrangednet in w a ork spatial It actually has a twthree o-dimensional mirroring the structure of the image in the retina. For example, ligh lightt 1. arriving V1 is arranged in a spatial map. It actually has a t w o-dimensional structure at the low lower er half of the retina affects only the corresp corresponding onding half of mirroring the structure of the image in the retina. F or example, light V1. Con Conv volutional net netw works capture this prop property erty by having their features arriving in atterms the low of the retina affects only the corresp onding half of defined ofertwhalf o dimensional maps. V1. Convolutional networks capture this prop erty by having their features defined in terms two dimensional maps. 2. V1 contains man many yofsimple cel ells ls ls.. A simple cell’s activit activity y can to some extent b e characterized by a linear function of the image in a small, spatially lo localized calized 2. receptiv V1 contains manThe y simple cel ls.units A simple cell’s activity can some be receptive e field. detector of a con conv volutional net netw wto ork are extent designed characterized by aprop linear function of the image in a small, spatially lo calized to em emulate ulate these properties erties of simple cells. receptive field. The detector units of a convolutional network are designed to em ulate these prop erties of simple 3. V1 also contains man many y complex cel ells ls ls..cells. These cells resp respond ond to features that are similar to those detected by simple cells, but complex cells are in inv variant 3. to V1small also contains man y c omplex c el ls . These cells resp ond to features that shifts in the p osition of the feature. This inspires the p o oling units arecon similar to those detected by simple cells, are inchanges variant of conv volutional net netw w orks. Complex cells are but alsocomplex inv invarian arian arianttcells to some to ligh small shifts the p osition of thesimply feature.byThis inspires p o oling units in lighting ting thatincannot b e captured p o oling overthe spatial lo locations. cations. of convin olutional works. Complex arecross-c also inv ariantp otooling somestrategies changes These inv variancesnet hav have e inspired somecells of the cross-channel hannel in ligh ting that cannot b e captured simply b y p o oling o v er spatial lo in con conv volutional net netw works, suc such h as maxout units (Go Goodfellow odfellow et al., cations. 2013a). These invariances have inspired some of the cross-channel p o oling strategies in convwolutional netw orks,ab suc h as maxout units (Gobodfellow et al.the , 2013a ). Though e kno know w the most about out V1, it is generally eliev elieved ed that same basic principles apply to other areas of the visual system. In our carto cartoon on view of Though w e kno w the most ab out V1, it is generally b eliev ed that theeatedly same the visual system, the basic strategy of detection follo follow wed by p o oling is rep repeatedly basic principles apply toerother of theAsvisual system. In our cartoanatomical on view of applied as we mov move e deep deeper into areas the brain. we pass through multiple the visual system, strategy ofcells detection folloond wed to bysome p o oling is repconcept eatedly la lay yers of the brain,the we basic ev even en entually tually find that resp respond sp specific ecific applied as w e mov e deep er into the brain. As we pass through m ultiple anatomical and are inv invarian arian ariantt to many transformations of the input. These cells hav havee b een layers of the brain, we eventually find cells that resp ond to some sp ecific concept 366 and are invariant to many transformations of the input. These cells have b een
CHAPTER 9. CONVOLUTIONAL NETWORKS
nic nicknamed knamed “grandmother cells”—the idea is that a person could hav havee a neuron that activ activates ates when seeing an image of their grandmother, regardless of whether she nicknamed cells”—the ideaimage, is thatwhether a personthe could havis e aa neuron that app appears ears in “grandmother the left or righ right t side of the image close-up of activ ates when seeing an image of their grandmother, regardless of whether she her face or zo zoomed omed out shot of her entire b o dy dy,, whether she is brightly lit, or in app ears in the left or righ t side of the image, whether the image is a close-up of shado shadow, w, etc. her face or zo omed out shot of her entire b o dy, whether she is brightly lit, or in These grandmother cells ha hav ve b een sho shown wn to actually exist in the human brain, shadow, etc. in a region called the medial temp temporal oral lob lobee (Quiroga et al., 2005). Researc Researchers hers These grandmother cells ha v e b een sho wn to actually exist in the human brain, tested whether individual neurons would resp respond ond to photos of famous individuals. in a region called the medial temp oral lob e ( Quiroga et al. , 2005).anResearc hers They found what has come to b e called the “Halle Berry neuron”: individual tested whether individual wouldof resp ond Berry to photos famousfires individuals. neuron that is activ activated ated bneurons y the concept Halle Berry. . Thisofneuron when a They found what has come to b e called the “Halle Berry neuron”: an individual p erson sees a photo of Halle Berry Berry,, a drawing of Halle Berry Berry,, or even text containing neuron that is activ ated b y the concept of Halle Berry . This fires when a the words “Halle Berry Berry.” .” Of course, this has nothing to do withneuron Halle Berry herself; p erson sees a photo of Halle Berry , a drawing of Halle Berry , or even text containing other neurons resp responded onded to the presence of Bill Clin Clinton, ton, Jennifer Aniston, etc. the words “Halle Berry.” Of course, this has nothing to do with Halle Berry herself; These medial temp temporal oral lob lobee neurons are somewhat more general than mo modern dern other neurons resp onded to the presence of Bill Clinton, Jennifer Aniston, etc. con conv volutional net netw works, which would not automatically generalize to iden identifying tifying These medial temp oral lob e neurons are somewhat more general than mo dern a p erson or ob object ject when reading its name. The closest analog to a con conv volutional conw volutional w which would not automatically generalize to oral identifying net netw ork’s last net lay layer erorks, of features is a brain area called the inferotemp inferotemporal cortex a p erson or ob ject when reading its name. The closest to a con volutional (IT). When viewing an ob object, ject, information flows fromanalog the retina, through the net w ork’s last lay er of features is a brain area called the inferotemp oral cortex LGN, to V1, then onw onward ard to V2, then V4, then IT. This happ happens ens within the first (IT). When viewing an ob ject, information flows from the retina, through the 100ms of glimpsing an ob object. ject. If a p erson is allow allowed ed to con continue tinue lo looking oking at the LGN, to V1, then onw ard to V2, then V4, then IT. This happ ens within the first ob object ject for more time, then information will b egin to flo flow w bac backw kw kwards ards as the brain 100ms of glimpsing an obtoject. If a the p erson isations allowed to con tinue lo oking the uses top-down feedback up update date activ activations in the low lower er level brainat areas. ob ject for ifmore time, then will and b egin to flowonly backw as rates the brain Ho How wev ever, er, we interrupt theinformation p erson’s gaze, observe theards firing that uses top-down feedback to up date the activ ations in the low er level brain areas. result from the first 100ms of mostly feedforw feedforward ard activ activation, ation, then IT pro prov ves to b e Ho w ev er, if w e interrupt the p erson’s gaze, and observe only the firing rates very similar to a con conv volutional net netw work. Conv Convolutional olutional netw networks orks can predictthat IT result rates, from the of mostly feedforwto ard(time activation, then IT proon vesob toject be firing andfirst also100ms p erform very similarly limited) humans object vrecognition ery similar tasks to a con volutional (DiCarlo , 2013net ). work. Convolutional networks can predict IT firing rates, and also p erform very similarly to (time limited) humans on ob ject That b eing said, there ,are man many between een con conv volutional net netw works recognition tasks (DiCarlo 2013 ). y differences betw and the mammalian vision system. Some of these differences are well known That b eing said, there are manbut y differences betw eene con volutional works to computational neuroscientists, outside the scop scope of this b o ok. net Some of and the mammalian vision system. Some of these differences are w ell known these differences are not yet known, because man many y basic questions ab about out ho how w the to computational neuroscientists, but outside the scop e of this b o ok. Some of mammalian vision system works remain unansw unanswered. ered. As a brief list: these differences are not yet known, because many basic questions ab out how the mammalian visioney system works remain unanswered. Asfor a brief eye e is mostly very low resolution, except a tin tiny ylist: patc patch h called the • The human fove fovea a. The fov fovea ea only observes an area ab about out the size of a thum thumbnail bnail held at The h uman ey e is mostly very low resolution, except for a tin y patc called the arms length. Though we feel as if we can see an en entire tire scene in highhresolution, foveais. an Theillusion fovea only observes area ab out the of abrain, thumbnail held at • this created by theansub subconscious conscious partsize of our as it stitches arms length. Though we feel as if areas. we canMost see ancon envtire scene in high resolution, together several glimpses of small conv olutional netw networks orks actually this is an illusion created by the sub conscious part of our brain, as it receiv receivee large full resolution photographs as input. The human brainstitches makes together several glimpses of small areas. Most convolutional networks actually 367 receive large full resolution photographs as input. The human brain makes
CHAPTER 9. CONVOLUTIONAL NETWORKS
sev several eral ey eyee mov movemen emen ements ts called sac sacccades to glimpse the most visually salien salientt or task-relev task-relevant ant parts of a scene. Incorporating similar attention mechanisms sev eyelearning movemen tsdels called saccactive ades toresearch glimpsedirection. the most visually salien in into toeral deep mo models is an In the con context textt or of task-relev ant parts of a scene. Incorporating similar attention mechanisms deep learning, atten attention tion mec mechanisms hanisms ha hav ve been most successful for natural in to deep learning mo dels is an active research direction. In the conmo text of language pro processing, cessing, as describ described ed in Sec. 12.4.5.1 . Several visual models dels deep fo learning, attention mec hanisms velop e been mostsosuccessful for bnatural with fov veation mechanisms hav have e b een ha dev develop eloped ed but far hav havee not ecome language pro cessing, as describ ed in Sec. 12.4.5.1 . Several visual mo the dominan dominantt approac approach h (Laro Larocchelle and Hinton, 2010; Denil et al. al.,, 2012dels ). with foveation mechanisms have b een develop ed but so far have not b ecome • The the dominan approac h (Laro helle and Hinton , 2010;other Denilsenses, et al., 2012 human tvisual system is cintegrated with many such).as hearing, and factors like our mo moo o ds and thoughts. Conv Convolutional olutional netw networks orks The human visual system is integrated with many other senses, such as so far are purely visual. hearing, and factors like our mo o ds and thoughts. Conv olutional netw orks • so farhuman are purely • The visualvisual. system do does es muc much h more than just recognize ob objects. jects. It is able to understand entire scenes including many ob objects jects and relationships The h uman visual system do es muc h more than just recognize obneeded jects. Itfor is b et etw ween ob objects, jects, and pro processes cesses rich 3-D geometric information ablebto understand entire scenes including many ob jects and relationships • our odies to interface with the w orld. Conv Convolutional olutional netw networks orks hav havee been b et w een ob jects, and pro cesses rich 3-D geometric information needed for. applied to some of these problems but these applications are in their infancy infancy. our b odies to interface with the world. Convolutional networks have been applied to some of areas theselike problems theseimpacted applications are in their Even en simple brain V1 arebut hea heavily vily by feedback frominfancy higher. • Ev lev levels. els. Feedbac eedback k has b een explored extensiv extensively ely in neural netw network ork mo models dels but Ev en simple brain areas like V1 are hea vily impacted b y feedback from higher has not yet b een sho shown wn to offer a compelling improv improvemen emen ement. t. lev els. F eedbac k has b een explored extensiv ely in neural netw ork mo dels but • • While has notfeedforw yet b een to offer compelling improv emen t. information as feedforward ardsho ITwn firing ratesa capture muc uch h of the same con conv volutional net netw work features, it is not clear how similar the intermediate While feedforw ard firing rates captureuses mucvery h of the same activ information as computations are. IT The brain probably different activation ation and con volutional network it neuron’s is not clear how similar the intermediate • p ooling functions. An features, individual activ activation ation probably is not wellcomputations are. The brain probably uses very different activ ation and characterized by a single linear filter resp response. onse. A recent mo model del of V1 in inv volv olves es p ooling functions. An individual neuron’s activ ation probably is not wellmultiple quadratic filters for eac each h neuron (Rust et al., 2005). Indeed our ccarto haracterized by a single linear filter resp onse. A recent delt of V1 inavolv es cartoon on picture of “simple cells” and “complex cells” mo migh might create nonm ultiple quadratic filters eacand h neuron (Rust al., 2005 our existen simplefor cells complex cellsetmight b oth). bIndeed e the same existent t distinction; cartoof oncell picture of “simple cells” and “complex migh t create a nonkind but with their “parameters” enablingcells” a con contin tin tinuum uum of b ehaviors existent from distinction; cells andtocomplex b oth b e the same ranging what wsimple e call “simple” what wecells call might “complex.” kind of cell but with their “parameters” enabling a continuum of b ehaviors we call “simple” to what wehas call told “complex.” It ranging is also from worthwhat mentioning that neuroscience us relatively little ab about out how to tr train ain con conv volutional netw networks. orks. Mo Model del structures with parameter It is also worth mentioning that neuroscience told us relativelymo little sharing across multiple spatial lo locations cations date back tohas early connectionist models dels ab out how to train volutional del structures withthe parameter of vision (Marr and con Poggio , 1976),netw butorks. theseMo models did not use modern sharing across multiple spatial cations date backFto connectionist mo dels bac back-propagation k-propagation algorithm andlogradient descent. or early example, the Neo Neocognitron cognitron vision (Marr Poggio , 1976 ), but these models did notdesign use the modern (of Fukushima , 1980and ) incorp incorporated orated most of the mo model del architecture elemen elements ts of bac k-propagation algorithm and gradient descent. F or example, the Neo cognitron the mo modern dern conv convolutional olutional net netw work but relied on a lay layer-wise er-wise unsup unsupervised ervised clustering (algorithm. Fukushima, 1980) incorp orated most of the mo del architecture design elements of the mo dern convolutional network but relied on a layer-wise unsup ervised clustering 368 algorithm.
CHAPTER 9. CONVOLUTIONAL NETWORKS
Lang and Hinton (1988) introduced the use of back-propagation to train timedelay neur neural al networks (TDNNs). To use contemporary terminology terminology,, TDNNs are Lang and Hinton ( 1988 ) introduced the use of back-propagation to train timeone-dimensional con conv volutional net netw works applied to time series. Back-propagation delay neur networks (TDNNs). To use contemporary terminology , TDNNs are applied to al these mo models dels was not inspired by any neuroscientific observ observation ation and one-dimensional con v olutional net w orks applied to time series. Back-propagation is considered by some to b e biologically implausible. Following the success of applied to these mo dels training was not inspired by any neuroscientific ation and bac back-propagation-based k-propagation-based of TDNNs, (LeCun et al., 1989observ ) dev developed eloped the is considered by some to b e biologically implausible. F ollowing the success of mo modern dern con conv volutional net netw work by applying the same training algorithm to 2-D bac training of TDNNs, (LeCun et al., 1989) developed the con conv vk-propagation-based olution applied to images. mo dern convolutional network by applying the same training algorithm to 2-D So far we hav havee described how simple cells are roughly linear and selectiv selectivee for convolution applied to images. certain features, complex cells are more nonlinear and b ecome inv invariant ariant to some So far we havof e described howcell simple cells are linear and selectiv e for transformations these simple features, androughly stac stacks ks of la lay yers that alternate certain complex cells are more nonlinear and b ecome some b et etw weenfeatures, selectivity and in inv variance can yield grandmother cellsinv forariant very to sp specific ecific transformations of these simple cell features, and stac ks of la y ers that alternate phenomena. We ha hav ve not yet describ described ed precisely what these individual cells detect. b etaween selectivity inork, variance canb eyield grandmother cells for sp ecific In deep, nonlinearand netw network, it can difficult to understand thevery function of phenomena. W e ha v e not y et describ ed precisely what these individual cells detect. individual cells. Simple cells in the first lay layer er are easier to analyze, b ecause their In a deep, nonlinear netw ork, it can b e difficult to understand thework, function of resp responses onses are driven by a linear function. In an artificial neural net netw we can individual cells. Simple cells in the first lay er are easier to analyze, b ecause their just display an image of the conv convolution olution kernel to see what the corresp corresponding onding onsesofare driven by a linear neural netwnet ork, we can cresp hannel a con conv volutional la lay yer function. resp responds onds In to.an Inartificial a biological neural netw work, we just display an image of the conv olution k ernel to see what the corresp onding do not hav havee access to the weigh eights ts themselves. Instead, we put an electro electrode de in the cneuron hannelitself, of a con v olutional la y er resp onds to. In a biological neural net w ork, we displa display y sev several eral samples of white noise images in front of the animal’s do not hav e access to the w eigh ts themselves. Instead, w e put an electro de in W thee retina, and record how each of these samples causes the neuron to activ activate. ate. neuron sevdel eraltosamples of white noise images in front the animal’s can thenitself, fit a displa linearymo model these resp responses onses in order to obtain an of approximation retina, and record how these samples causes the neuron to activ ate. Whe of the neuron’s weigh weights. ts.each Thisofapproach is known as reverse corr orrelation elation (Ringac Ringach can then fit a, 2004 linear and Shapley ). mo del to these resp onses in order to obtain an approximation of the neuron’s weights. This approach is known as reverse correlation (Ringach Rev Reverse erse correlation shows us that most V1 cells hav havee weigh weights ts that are describ described ed and Shapley, 2004). by Gab Gabor or functions functions.. The Gab Gabor or function describes the weigh weightt at a 2-D p oin ointt in the Reverse correlation us that cells haveofweigh that are describ I (x, yed image. We can think ofshows an image as most b eingV1 a function 2-D ts co coordinates, ordinates, ). b y Gab or functions . The Gab or function describes the weigh t at a 2-D p oin t in the Lik Likewise, ewise, we can think of a simple cell as sampling the image at a set of lo locations, cations, I (x, y). image. W e can think of an image as b eing a function of 2-D co ordinates, defined by a set of x co coordinates ordinates X and a set of y co coordinates, ordinates, Y, and applying Lik ewise, we are can also thinka of a simple sampling the a set of lo cations, w (x, y )image w eigh eights ts that function ofXcell the as lo location, cation, . Fromatthis Y p oint of view, x co ordinates defined by setaofsimple and ais set of ybyco ordinates, , and applying the resp response onsea of cell to an image giv given en weights that are also a function ofX the lo cation, w (x, y ). From this p oint of view, X s(Ito ) =an imagewis(x, y )en I (x, (9.15) the resp onse of a simple cell giv byy ). x∈X y∈Y
s(I ) = w (x, y )I (x, y ). Sp Specifically ecifically ecifically,, w (x, y ) takes the form of a Gabor function: 02 −βxxfunction: − βy y 02 cos( w (x, y ; α,, w β x(x, , βyy ,)ftakes , φ, x 0the , y 0,form τ ) = of α exp cos(f f x 0 + φ), Sp ecifically a Gabor
where β y cos(f x + φ), w (x, y ; α , β , β , f , φ, x , y , τ ) X = αX exp β x 0 x = ((x x − x0 ) cos(τ ) + 0) sin(τ ) − (y − y− where x = (x x ) cos(369 τ ) + (y y ) sin(τ ) − −
(9.15)
(9.16) (9.16) (9.17) (9.17)
CHAPTER 9. CONVOLUTIONAL NETWORKS
and
y 0 = −(x − x0) sin(τ ) + (y − y0 ) cos(τ ).
(9.18) and x ) sin( τ) + (y y ) cos( τ ).control the prop (9.18) Here, α , β x, βy , fy, φ= , x 0,(xy0, and parameters that properties erties τ are of the Gab Gabor or function. Fig. of Gab Gabor or functions with − 9.18 − shows some examples − Here, , y , and τ are parameters that control the prop erties α , β , βof, these f, φ, xparameters. differen different t settings of the Gab or function. Fig. 9.18 shows some examples of Gab or functions with x0, y0 , and τ define a co Thetparameters coordinate ordinate system. W Wee translate and differen settings of these parameters. 0 0 rotate x and y to form x and y . Sp Specifically ecifically ecifically,, the simple cell will resp respond ond to image x y τ The parameters , , and define a co ordinate system. W e translate and features centered at the p oin ointt (x 0, y 0), and it will resp respond ond to changes in brigh brightness tness x and y to form rotate and y . τSpradians ecifically , thethe simple celltal. will resp ond to image as we mo mov ve along a linex rotated from horizon horizontal. features centered at the p oint 0(x , y 0), and it will resp ond to changes in brightness y , the function w then resp Viewed a function of x and as wView e moed ve as along a line rotated τ radians from the horizonresponds tal.onds to changes in 0 brigh brightness tness as we mov movee along the x axis. It has two imp important ortant factors: one is a x y w View ed as a function of and , the function then resp onds to changes in Gaussian function and the other is a cosine function. x axis. It has two imp ortant factors: one is a brightness as we move along the The Gaussian factor α exp −β xx 02 − βy y02 can b e seen as a gating term that Gaussian function and the other is a cosine function. ensures the simple cell will only resp respond ond to values near where x 0 and y 0 are b oth α exp x y can The Gaussian factor b e seen as aThe gating termfactor that β zero, in other words, near the cen center ter of theβ cell’s receptiv receptive e field. scaling ensures cell will only respsimple ond values near btrol oth − tocell’s α adjuststhe thesimple total magnitude of−the resp response, onse,where whilexβxand andy β are control y con zero, in other words, near the cen teroff. of the cell’s receptive field. The scaling factor ho how w quic quickly kly its receptiv receptive e field falls α adjusts the total magnitude of the simple cell’s resp onse, while β and β control cos cos((f x0 + φ ) con controls trols how the simple cell resp responds onds to ch changing anging howThe quiccosine kly itsfactor receptiv e field falls off. 0 brigh brightness tness along the x axis. The parameter f con controls trols the frequency of the cosine cos ( f x + φ The cosine factor ) con trols how the simple cell resp onds to changing and φ con controls trols its phase offset. brightness along the x axis. The parameter f controls the frequency of the cosine Altogether, this carto cartoon on view of simple cells means that a simple cell resp responds onds and φ controls its phase offset. to a sp specific ecific spatial frequency of brightness in a sp specific ecific direction at a sp specific ecific Altogether, this carto on view of simple cells means that a simple cell resp onds lo location. cation. Simple cells are most excited when the wave of brightness in the image to athe sp ecific spatialasfrequency of brightness a spthe ecific direction att where a sp ecific has same phase the weigh weights. ts. This o ccursinwhen image is brigh bright the lo cation. Simple cells are most excited when the w a v e of brightness in the image weigh eights ts are p ositiv ositivee and dark where the weigh eights ts are negativ negative. e. Simple cells are most has the same phase as the weigh ts. This o ccurs when the image is brigh t where the inhibited when the wa wave ve of brightness is fully out of phase with the weigh eights—when ts—when w eigh ts are p ositiv e and dark where the w eigh ts are negativ e. Simple cells are most the image is dark where the weigh eights ts are p ositiv ositivee and bright where the weigh eights ts are inhibited when the wa ve of brightness is fully out of phase with the w eigh ts—when negativ negative. e. the image is dark where the weights are p ositive and bright where the weights are L 2 norm of the The carto cartoon on view of a complex cell is that it computes the p negative. 2-D vector con containing taining tw two o simple cells’ resp responses: onses: c( I) = s0(I )2 + s1 (I )2 . An The carto on view of a complex cell is that it computes the L norm of the imp importan ortan ortantt sp special ecial case o ccurs when s 1 has all of the same parameters as s0 except 2-D twothat simple onses: c( I)out = ofsphase (I ) +with s (I )s .. An φ istaining s1 iscells’ for φvector , and con set such one resp quarter cycle 0 In s s imp ortan t sp ecial case o ccurs when has all of the same parameters as except this case, s0 and s1 form a quadr quadratur atur aturee pair air.. A complex cell defined in this wa way y φ φ s s for , and is set such that is one quarter cycle out of phase with . In 02 02 exp((−βx x − βy y ) contains resp responds onds when the Gaussian reweigh reweighted ted image I( x, y) exp s s this case, and form a quadr atur e p air . A complex cell defined in this τ near y 0 )y, a high amplitude sin sinusoidal usoidal wa wav ve with frequency f in direction (x0 ,wa p I( .x,In y) other exp ( words, β x the β ycomplex resp onds when reweigh image ) contains regardless of the theGaussian phase offset ofted this wa cell wave ve f τ x , y a high amplitude sin usoidal wa v e with frequency in direction near ( ), − − is inv invarian arian ariantt to small translations of the image in direction τ , or to negating the regardless of the phase offset of this wave. In other words, the complex cell is invariant to small translations of the370 image in direction τ , or to negating the
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.18: Gabor functions with a variety of parameter settings. White indicates large p ositiv ositivee weight, black indicates large negative weigh weight, t, and the background gra gray y Figure 9.18:toGabor functions with aorvariety of parameter settings. indicates corresp corresponds onds zero weigh eight. t. (L (Left) eft) Gab Gabor functions with different values ofWhite the parameters large control p ositivethe weight, black system: indicatesx 0large negative weigh t, and the background graisy that coordinate , y 0, and Each h Gab Gabor or function in this grid τ . Eac (L eft) corresp onds to zero t. y 0 prop Gab or functions different valuesand of the assigned a value of xw0eigh and proportional ortional to its pwith osition in its grid, chosen so τ isparameters x ,direction y , and radiating τ . Each Gab control the coordinate system: or function in this grid is that each Gab Gabor or filter is sensitive to the out from the center of the grid. x y τ assigned a value of and prop ortional to its p osition in its grid, and is c hosen so F or the other tw plots, , , and are fixed to zero. (Center) Gab functions with two o Gabor or x0 y 0 τ that each Gab or filter sensitive to βthe direction radiating out from the center the grid. differen different t Gaussian scaleis parameters Gabor or functions are arranged in of increasing βy . Gab x and (Center) x y τ For the(decreasing other two plots, , mo , and fixed through to zero. the grid, and Gab or functionsheight with width move ve left are to right increasing β x) as we different Gaussian scale parameters and βFor . Gab functions are arranged in increasing (decreasing mov move e top to bβottom. the or other tw two o plots, the β values are fixed βy ) as we width ) as we moveGabor left to functions right through grid, sin and increasing height to 1.5 1.5× the image βwidth. (Right) with the different sinusoid usoid parameters f ×(decreasing β β (decreasing as ewetop mov to b ottom. For the o plots, values are fixed and φ. As we )mov move toebtop ottom, andother as wetwmo mov ve leftthe to right, f increases, φ increases. toor1.5 the image width.φ (Right) Gabor functions sinusoid parameters f F the other two plots, is fixed to 0 and f is fixedwith to 5different × the image width. and φ× . As we move top to b ottom, f increases, and as we move left to right, φ increases. For the other two plots, φ is fixed to 0 and f is fixed to 5 the image width. image (replacing blac black k with white and vice versa). ×
Some of the most striking corresp correspondences ondences b et etw ween neuroscience and machine image (replacing black with white and vice versa). learning come from visually comparing the features learned by mac machine hine learning Some of the most striking corresp ondences b et w een neuroscience mo models dels with those employ employed ed by V1. Olshausen and Field (1996) and sho show wmachine ed that learning come from visually comparing the features learned b y mac hine learning a simple unsupervised learning algorithm, sparse co coding, ding, learns features with mo dels with those employ ed by V1. Olshausen and 1996 ) esho wed that that receptiv receptivee fields similar to those of simple cells. Since Field then, (we hav have found a simple unsupervised learning algorithm, sparsealgorithms co ding, learns with an extremely wide variety of statistical learning learn features features with receptiv e efields similar to those of simple cells.images. Since then, have found that Gab Gabor-lik or-lik or-like functions when applied to natural This we includes most deep an extremely wide vwhich ariety learn of statistical learning algorithms learn features with learning algorithms, these features in their first lay layer. er. Fig. 9.19 sho shows ws Gab or-lik e functions when applied to natural images. This includes most deep some examples. Because so many different learning algorithms learn edge detectors, learning algorithms, whichthat learn features in their first layiser. 9.19 mo shodel ws it is difficult to conclude anythese sp specific ecific learning algorithm theFig. “right” model some many different algorithms detectors, of theexamples. brain justBecause based onsothe features thatlearning it learns (though itlearn can edge certainly be a it is difficult to conclude that any sp ecific learning algorithm is the “right” mo bad sign if an algorithm do does es not learn some sort of edge detector when applieddel to of the brain just based on the features that it learns (though it can certainly b e natural images). These features are an imp importan ortan ortantt part of the statistical structurea bad sign if images an algorithm not learn sort of edget detector when to of natural and candobese recov by man approac to applied statistical recovered ered some many y differen different approaches hes natural images). These features an) imp t part statistical structure mo modeling. deling. See Hyv Hyvärinen ärinen et al. (are 2009 for aortan review of of thethe field of natural image of natural images and can b e recov ered by man y differen t approac hes to statistical statistics. mo deling. See Hyvärinen et al. (2009) for a review of the field of natural image statistics. 371
CHAPTER 9. CONVOLUTIONAL NETWORKS
Figure 9.19: Many mac machine hine learning algorithms learn features that detect edges or specific colors of edges when applied to natural images. These feature detectors are reminiscen reminiscentt of Figure 9.19: Many mac hine to learning algorithms learn visual features that detect edges or learned specific the Gab Gabor or functions known be present in primary cortex. (L (Left) eft) W eights colors edges when learning applied toalgorithm natural images. These feature detectors reminiscen t of b y an of unsup unsupervised ervised (spike and slab sparse co coding) ding)are applied to small (L eft) the Gab or functions known to be present in primary visual cortex. W eights learned image patches. (Right) Con Conv volution kernels learned by the first lay layer er of a fully supervised b y an unsup ervised algorithm (spike andofslab sparse ding) applied to small con conv volutional maxoutlearning netw network. ork. Neighboring pairs filters driv driveecothe same maxout unit. image patches. (Right) Convolution kernels learned by the first layer of a fully supervised convolutional maxout network. Neighboring pairs of filters drive the same maxout unit.
9.11
Con Conv volutional Net Netw works and the History of Deep Learning 9.11 Convolutional Networks and the History of Deep Con Conv volutional net netw works ha hav ve pla play yed an important role in the history of deep Learning learning. They are a key example of a successful application of insights obtained Con volutional works ve playlearning ed an important role They in thewere history of deep b y studying thenet brain to ha machine applications. also some of learning. They are a key example of a successful application of insights obtained the first deep mo models dels to p erform well, long b efore arbitrary deep mo models dels were by studyingviable. the brain tovolutional machine learning Theyofwere of considered Con Conv net netw worksapplications. were also some the also firstsome neural the mo to p erform well, long b efore arbitrary deep were net netw wfirst orks deep to solv solve e dels imp importan ortan ortant t commercial applications and remain at mo thedels forefront considered viable. Convolutional networksto were some of inthe neural of commercial applications of deep learning today day day.. Falso or example, thefirst 1990s, the net w orks to solv e imp ortan t commercial applications and remain at the forefront neural netw network ork researc research h group at AT&T developed a conv convolutional olutional netw network ork for of commercial deep).learning day For 1990s, example, the 1990s, reading chec hecks ks applications (LeCun et al.of , 1998b By the to end of. the thisin system deplo deploy ythe ed neural netw ork researc h group at A T&T developed a conv olutional netw ork for by NEC was reading ov over er 10% of all the chec checks ks in the US. Later, several OCR reading c hec ks ( LeCun et al. , 1998b ). By the end of the 1990s, this deployed ed and handwriting recognition systems based on conv convolutional olutional netssystem were deploy deployed b y NEC was reading ov er 10% of all the chec ks in the US. Later, several OCR by Microsoft (Simard et al., 2003). See Chapter 12 for more details on such and handwriting recognition basedofonconv conv olutional nets wereSee deploy ed applications and more mo applications net LeCun modern dernsystems convolutional olutional netw works. by al. Microsoft (Simard al., 2003 ). SeeofChapter 12 fornet more such et (2010) for a moreetin-depth history con conv volutional netw worksdetails up to on 2010. applications and more mo dern applications of convolutional networks. See LeCun Con Conv volutional netw networks orks were also used to win many contests. The current et al. (2010) for a more in-depth history of convolutional networks up to 2010. in intensit tensit tensity y of commercial interest in deep learning b egan when Krizhevsky et al. Con volutional networks wererecognition also used cto win many contests. Thenetw current (2012 ) won the ImageNet ob object ject hallenge, but conv convolutional olutional networks orks intensity of commercial interest in deep learning b egan when Krizhevsky et al. (2012) won the ImageNet ob ject recognition challenge, but convolutional networks 372
CHAPTER 9. CONVOLUTIONAL NETWORKS
had been used to win other mac machine hine learning and computer vision contests with less impact for years earlier. had been used to win other machine learning and computer vision contests with Con Conv volutional nets were some of the first working deep netw networks orks trained with less impact for years earlier. bac back-propagation. k-propagation. It is not en entirely tirely clear why conv convolutional olutional netw networks orks succeeded Con v olutional nets w ere some of the first working deep netw orks trained when general back-propagation netw networks orks were considered to ha hav ve failed. It with may back-propagation. Itolutional is not ennetw tirely clear why convcomputationally olutional networks succeeded simply b e that conv convolutional networks orks were more efficient than when general back-propagation orks to were to eriments have failed. may fully connected net netw works, so it wnetw as easier runconsidered multiple exp experiments withItthem simply b etheir that implementation convolutional netw were more computationally efficient than and tune andorks hyperparameters. Larger net netw works also seem fully connected net w orks, so it w as easier to run multiple exp eriments with them to b e easier to train. With modern hardw hardware, are, large fully connected net netw works and tune implementation hyperparameters. Larger works also app appear ear to their p erform reasonably onand man many y tasks, even when usingnet datasets that seem were to b e easier to train. With modern hardw are, large fully connected net w orks available and activ activation ation functions that were p opular during the times when fully app ear to pnet erform on man tasks, when using were connected netw worksreasonably were b elieved noty to workeven well. It may b edatasets that thethat primary a v ailable and activ ation functions that were p opular during the times when barriers to the success of neural net netw works were psychological (practitionersfully did connected works were b elieved not work may e that effort the primary not exp expect ect net neural netw networks orks to work, soto they didwell. not It mak make e a bserious to use barriersnet towthe success of er neural netwit orks were psychological (practitioners did neural netw orks). Whatev Whatever the case, is fortunate that conv convolutional olutional netw networks orks not exp ect well neural networks they did not makthe e a torch seriousforeffort to use p erformed decades ago.toInwork, manysowa ways, ys, they carried the rest of neural net w orks). Whatev er the case, it is fortunate that conv olutional netw orks deep learning and pa pav ved the way to the acceptance of neural net netw works in general. p erformed well decades ago. In many ways, they carried the torch for the rest of Con Conv volutional net netw works pro provide vide a wa way y to sp specialize ecialize neural net netw works to work deep learning and paved the way to the acceptance of neural networks in general. with data that has a clear grid-structured top topology ology and to scale suc such h mo models dels to Con v olutional net w orks pro vide a wa y to sp ecialize neural net w orks to work very large size. This approach has b een the most successful on a two-dimensional, with data that has a clear grid-structured top ology and to scale suc h mo dels image top topology ology ology.. To pro process cess one-dimensional, sequential data, we turn next to to vanother ery large approach has b een the most successful on a tw o-dimensional, p osize. werfulThis sp specialization ecialization of the neural netw networks orks framew framework: ork: recurren recurrent t neural image top ology. To pro cess one-dimensional, sequential data, we turn next to net netw works. another p owerful sp ecialization of the neural networks framework: recurrent neural networks.
373
Chapter 10 Chapter 10
Sequence Mo Modeling: deling: Recurren Recurrentt and Recursiv Recursivee Nets Sequence Modeling: Recurrent and Recursiv e Nets Recurr current ent neur neural al networks or RNNs (Rumelhart et al., 1986a) are a family of
Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of neural net networks works for processing sequen sequential tial data. Much as a con convolutional volutional net network work R e curr ent neur al networks or RNNs ( Rumelhart et al. , 1986a ) are a family of is a neural netw network ork that is sp specialized ecialized for processing a grid of values X suc such h as neural networks for processing sequen Much as a con network an image, a recurrent neural netw network orktial is adata. neural netw network ork thatvolutional is specialized for X is a neural netw ork that is sp ecialized for processing a grid of v alues suc h as (1) ( τ ) pro processing cessing a sequence of values x , . . . , x . Just as con conv volutional netw networks orks an image, recurrent neural netw ork width is a neural networkand that is specialized for can readilyascale to images with large and height, some conv convolutional olutional x , . . . , x pro cessing a sequence of v alues . Just as con v olutional netw orks net netwo wo works rks can pro process cess images of variable size, recurrent netw networks orks can scale to muc much h can readily scale to images with width height, and some convolutional longer sequences than would be large practical forand net networks works without sequence-based netecialization. works can proMost cess images of vnetw ariable size, networks can scale much sp specialization. recurrent networks orks canrecurrent also process sequences of to variable longer sequences than would be practical for networks without sequence-based length. specialization. Most recurrent networks can also process sequences of variable To go from multi-la multi-lay yer netw networks orks to recurren recurrentt net networks, works, we need to take adv advananlength. tage of one of the early ideas found in machine learning and statistical mo models dels of o go from multi-la yer netw orks to recurren t netof works, we need to takesharing advanthe T 1980s: sharing parameters across differen different t parts a mo model. del. Parameter tagees of itone of the to early ideasand found in machine learning and statistical models of mak makes possible extend apply the model to examples of different forms the 1980s: sharing parameters across differen t parts of a mo del. Parameter sharing (differen (differentt lengths, here) and generalize across them. If we had separate parameters makeach es it vpalue ossible to extend and apply thenot model to examples of different forms for of the time index, we could generalize to sequence lengths not (differen t lengths, here) and generalize across them. If we had separate parameters seen during training, nor share statistical strength across different sequence lengths for each value of thepositions time index, we could generalize to sequence lengths not and across different in time. Suc Such h not sharing is particularly imp importan ortan ortant t when seen during training, nor share statistical strength across different sequence lengths a sp specific ecific piece of information can o ccur at multiple positions within the sequence. and across different positions in time. Such“I sharing particularly imp ortan F or example, consider the tw two o sentences went toisNepal in 2009” and “Int when 2009, a sp ecific piece of information can o ccur at multiple p ositions within the sequence. I wen wentt to Nepal.” If we ask a mac machine hine learning mo model del to read each sen sentence tence and F or example, consider the tw o sentences “I w ent to Nepal in 2009” and “In 2009, extract the year in which the narrator wen wentt to Nepal, we would lik likee it to recognize Ithe wen t to Nepal.” If we ask a mac hine learning mo del to read each sen tence and year 2009 as the relev relevan an antt piece of information, whether it appears in the sixth extract the year in which the narrator went to Nepal, we would like it to recognize 374 the year 2009 as the relevant piece of information, whether it appears in the sixth 374
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
word or the second word of the sen sentence. tence. Supp Suppose ose that we trained a feedforward net network work that pro processes cesses sen sentences tences of fixed length. A traditional fully connected w ord or the of hav theesen tence. Supp ose that trained feedforward feedforw feedforward ard second net network workword would have separate parameters forwe each inputa feature, so it net work that pro cesses sen tences of fixed length. A traditional fully connected would need to learn all of the rules of the language separately at each position in feedforw ard net would hav separate neural parameters for shares each input feature, so ts it the sen sentence. tence. Bywork comparison, aerecurrent netw network ork the same weigh weights w ould need to time learnsteps. all of the rules of the language separately at each position in across several the sentence. By comparison, a recurrent neural network shares the same weights A related idea is the use of con convolution volution across a 1-D temp temporal oral sequence. This across several time steps. con convolutional volutional approac approach h is the basis for time-delay neural net networks works (Lang and A related idea is the use of con volution across a 1-D temp oral sequence. This Hin Hinton ton, 1988; Waib aibel el et al. al.,, 1989; Lang et al. al.,, 1990). The conv convolution olution op operation eration con volutional approac h is the basis foracross time-delay neural networksThe (Lang and allo allows ws a net netw work to share parameters time, but is shallow. output Hincon tonv,olution 1988; W et al., 1989 ; Lang al., b1990 Theoutput convolution operation of conv isaib a el sequence where eac each hetmem memb er of). the is a function of allo ws a net w ork to share parameters across time, but is shallow. The output a small num umb ber of neigh neighb boring members of the input. The idea of parameter of con v olution is a sequence where eac h mem bercon ofvolution the output is aatfunction of sharing manifests in the application of the same convolution kernel each time a small number of neigh oringparameters members in of athe input.wa The idea of parameter step. Recurren Recurrent t netw networks orks b share different way y. Eac Each h member of the sharing manifests in the application of the same con volution kernel at each output is a function of the previous members of the output. Each member oftime the step. Recurren t netw orks share parameters in a different wa y . Eac h member of the output is pro produced duced using the same up update date rule applied to the previous outputs. output is a function of the previous members of the output. Each member of the This recurren recurrentt form formulation ulation results in the sharing of parameters through a very output is produced using deep computational graph.the same update rule applied to the previous outputs. This recurrent formulation results in the sharing of parameters through a very the simplicitygraph. of exp exposition, osition, we refer to RNNs as op operating erating on a sequence deepFor computational ( t ) that con contains tains vectors x with the time step index t ranging from 1 to τ . In F or the simplicity ofwexp refer RNNs as ophes erating onh asequences, sequence practice, recurren recurrent t net netw orksosition, usuallyweop operate eratetoon minibatc minibatches of suc such that acon tains vectors with the time index fromh.1 to . In x length t ranging τ for with different sequence eac each hstep mem memb ber of the minibatc minibatch. Weτ ha have ve practice, recurren t net w orks usually op erate on minibatc hes of suc h sequences, omitted the minibatch indices to simplify notation. Moreov Moreover, er, the time step index τ with a different sequence length for eac h mem b er of the minibatc Wetohathe ve need not literally refer to the passage of time in the real world, but h. only omitted the minibatch indices to simplify notation. Moreov er, the time step index position in the sequence. RNNs ma may y also be applied in tw two o dimensions across need not literally refer to theand passage timeapplied in the to real world, but only to the the spatial data such as images, even of when data inv involving olving time, p osition in ythe sequence. RNNs also be applied in provided two dimensions net netwo wo work rk ma may ha have ve connections thatma goy bac backwards kwards in time, that the across entire spatial data such as images, and even when applied to data inv olving time, the sequence is observed before it is provided to the netw network. ork. network may have connections that go backwards in time, provided that the entire This chapter extends the idea of a computational graph to include cycles. These sequence is observed before it is provided to the network. cycles represent the influence of the presen presentt value of a variable on its own value This c hapter extends the idea of a computational cycles. Theset at a future time step. Suc Such h computational graphsgraph allo allow wtousinclude to define recurren recurrent cycles represent of the presen t valuet of a variable on its own neural netw networks. orks.the Weinfluence then describ describe e many differen different ways to construct, train,value and at a future time step. Suc h computational graphs allo w us to define recurren t use recurrent neural netw networks. orks. neural networks. We then describe many different ways to construct, train, and For more information on recurrent neural net netw works than is available in this use recurrent neural networks. chapter, we refer the reader to the textb textbo ook of Grav Graves es (2012). For more information on recurrent neural networks than is available in this chapter, we refer the reader to the textbook of Graves (2012).
375
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.1
Unfolding Computational Graphs
A computational graph isComputational a wa way y to formalize the Graphs structure of a set of computations, 10.1 Unfolding suc such h as those in involv volv volved ed in mapping inputs and parameters to outputs and loss. A computational graph is afor waya to formalize the structure a set of computations, Please refer to Sec. 6.5.1 general introduction. In of this section we explain suc h as those in volv ed in mapping inputs and parameters to outputs and loss. the idea of unfolding a recursive or recurrent computation in into to a computational Pleasethat referhas to aSec. 6.5.1 for a general introduction. In this to section weofexplain graph rep repetitive etitive structure, typically corresp corresponding onding a chain even events. ts. the idea of unfolding a recursive or recurrent computation in to a computational Unfolding this graph results in the sharing of parameters across a deep netw network ork graph that has a rep etitive structure, typically corresp onding to a chain of even ts. structure. Unfolding this graph results in the sharing of parameters across a deep network For example, consider the classical form of a dynamical system: structure. s(t) = fform (s(t−1)of; θa),dynamical system: (10.1) For example, consider the classical where s (t) is called the state of sthe system. = f (s
; θ), (10.1) Eq. 10.1 is recurrent because the definition of s at time t refers back to the where s is called the state of the system. same definition at time t − 1. Eq. 10.1 is recurrent because the definition of s at time t refers back to the τ Fordefinition a finite num number ber of same at time t time 1. steps , the graph can be unfolded by applying the definition τ − 1 times. For example, if we unfold Eq. 10.1 for τ = 3 time steps, we For a finite number of − time steps τ , the graph can be unfolded by applying the obtain definition τ 1 times. For example, if we unfold Eq. 10.1 for τ = 3 time steps, we obtain − s(3) =f (s(2); θ) (10.2) s
(1) = =ff ((fs (s ; θ;)θ); θ)
(10.3) (10.2)
=f (f (s applying ; θ); θ) the definition in this wa (10.3) Unfolding the equation by rep repeatedly eatedly way y has yielded an expression that does not inv involv olv olvee recurrence. Suc Such h an expression can Unfolding the equation by rep eatedly applying the definition in this way The has no now w be represen represented ted by a traditional directed acyclic computational graph. yielded expression that doesof not olvand e recurrence. h an expression can. unfoldedancomputational graph Eq.inv 10.1 Eq. 10.3 isSuc illustrated in Fig. 10.1 now be represented by a traditional directed acyclic computational graph. The unfolded computational graph of Eq. 10.1 and Eq. 10.3 is illustrated in Fig. 10.1. s(... )
f
s(t1)
f
s(t)
f
s(t+1)
f
s(... )
Figure 10.1: The classical dynamical system described by Eq. 10.1, illustrated as an unfolded computational graph. Eac Each h no node de represen represents ts the state at some time t and the Figure 10.1: The classical dynamical system described by Eq. 10.1, illustrated an function f maps the state at t to the state at t + 1. The same parameters (the same as value t unfolded Eachfor noall de time represen ts the state at some time and the of θ used computational to parametrize graph. f ) are used steps. function f maps the state at t to the state at t + 1. The same parameters (the same value f are used for all time steps. of θAs used to parametrize another example, )let us consider a dynamical system driven by an external
signal x(t), As another example, let uss(tconsider a 1) dynamical ) = f (s(t− , x(t) ; θ), system driven by an external (10.4) signal x , s = f (s376 , x ; θ), (10.4)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
where we see that the state now con contains tains information ab about out the whole past sequence. Recurren Recurrentt neural netw networks orks can be built in man many y differen differentt wa ways. ys. Much as where we see that the state now contains information about the whole past sequence. almost any function can be considered a feedforw feedforward ard neural netw network, ork, essentially Recurren t neural netw orks can b e built in man y differen t wa ys. netw Much an any y function in involving volving recurrence can be considered a recurren recurrentt neural network. ork.as almost any function can be considered a feedforward neural network, essentially Man Many y recurrent neural net networks works use Eq. 10.5 or a similar equation to define any function involving recurrence can be considered a recurrent neural network. the values of their hidden units. To indicate that the state is the hidden units of york, recurrent net works Eq.the 10.5 or a similar equation the to define the Man netw network, we nowneural rewrite Eq. 10.4use using variable h to represent state: the values of their hidden units. To indicate that the state is the hidden units of (t−1) (t) ), =10.4 f (husing , the x(t) ;vθariable the network, we now rewrite hEq. h to represent the(10.5) state: illustrated in Fig. 10.2, typical will add such h as ; θ),architectural features suc (10.5) h RNNs = f (h , x extra output lay layers ers that read information out of the state h to make predictions. illustrated in Fig. 10.2, typical RNNs will add extra architectural features such as When the recurrent netw network ork is trained to perform a task that requires predicting output layers that read information out of the state h to make predictions. the future from the past, the netw network ork typically learns to use h (t) as a kind of lossy When of thethe recurrent netw orkasp is ects trained to ppast erform a task that requires summary task-relev task-relevant ant aspects of the sequence of inputs up predicting to t. This h the future from the past, the netw ork typically learns to use as a kind of lossy summary is in general necessarily lossy lossy,, since it maps an arbitrary length sequence of, x the ant asp the length past sequence (t−task-relev 2), . . . , x(2) (t)inputs up to t. This (summary x(t), x(t−1) , x(1) ) ects to a of fixed vector hof . Depending on the summarycriterion, is in general lossy,t since it maps an arbitrary length training thisnecessarily summary migh might selectiv selectively ely keep some asp aspects ects of sequence the past (sequence x , x with , x more , . precision . . , x , xthan h ) toother a fixed length vector . Depending the asp aspects. ects. F For or example, if the RNN on is used training criterion, thismo summary might selectiv ely keep someword aspects of previous the past in statistical language modeling, deling, typically to predict the next giv given en sequence with more precision than other asp ects. F or example, if the RNN is used words, it may not be necessary to store all of the information in the input sequence in statistical language deling, typically to predict the next givthe en sentence. previous up to time t , but rathermo only enough information to predict theword rest of words, it may not be necessary all w ofe the in hthe input sequence h(t) to be ric The most demanding situationtoisstore when askinformation rich enough to allow t up to time , but rather only enough information to predict the rest of the sentence. one to approximately reco recover ver the input sequence, as in auto autoenco enco encoder der frameworks h The most demanding situation is when w e ask to b e ric h enough to allow (Chapter 14). one to approximately recover the input sequence, as in autoencoder frameworks (Chapter 14).
Figure 10.2: A recurrent netw network ork with no outputs. This recurrent netw network ork just pro processes cesses information from the input x by incorp incorporating orating it into the state h that is passed forward Figure 10.2: netwdiagram. ork with The no outputs. This recurrent ork of just processes through time.A recurrent Circuit black square indicates netw a delay 1 time step. x h information from the input b y incorp orating it into the state that is passed forward The same netw network ork seen as an unfolded computational graph, where each no node de is through time. Circuit diagram. The black square indicates a delay of 1 time step. no now w asso associated ciated with one particular time instance. The same network seen as an unfolded computational graph, where each node is nowEq. asso10.5 ciatedcan withbeone particular instance. dra drawn wn in twtime o different ways. One way to dra draw w the RNN is
with a diagram con containing taining one no node de for ev every ery component that migh mightt exist in a Eq. 10.5 can be drawn in two different ways. One way to draw the RNN is 377every component that might exist in a with a diagram containing one node for
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
ph physical ysical implementation of the mo model, del, suc such h as a biological neural net network. work. In this view, the netw network ork defines a circuit that op operates erates in real time, with ph physical ysical ph ysical implementation of the mo del, suc h as a biological neural net work. In. parts whose current state can influence their future state, as in the left of Fig. 10.2 this view, thethis netw ork defines a circuit operates in realdiagram time, with physical Throughout chapter, we use a blackthat square in a circuit to indicate parts whose current state can influence their future state, as in the left of Fig. 10.2. that an interaction takes place with a delay of 1 time step, from the state at time Throughout chapter, use aother blackwa square inwa the circuit diagram to unfolded indicate t to the statethis t + 1we at time . The way y to dra draw RNN is as an that an interaction takes place with delayonent of 1 time step, fromby theman state at timet computational graph, in which eac each hacomp component is represented many y differen different t + 1p to the state . er The other y to draw the unfolded vtariables, with at onetime variable time step,wa representing the RNN state is of as thean comp component onent computational graph, in which eac h comp onent is represented by man y differen at that point in time. Eac Each h variable for each time step is dra drawn wn as a separate no node det vofariables, with one variable er in time stateweofcall the unfolding component the computational graph,pas thestep, rightrepresenting of Fig. 10.2.the What is at that p oint in time. Eac h v ariable for each time step is dra wn as a separate no de the op operation eration that maps a circuit as in the left side of the figure to a computational of the computational graph,asasininthe theright rightside. of Fig. . Whatgraph we call unfolding is graph with rep repeated eated pieces The10.2 unfolded now has a size the op eration that maps a circuit as in the left side of the figure to a computational that dep depends ends on the sequence length. graph with repeated pieces as in the right side. The unfolded graph now has (at)size We can represent the unfolded recurrence after t steps with a function g : that dep ends on the sequence length. (t−1) ) (t) (2) We can represent with : h(tthe =gunfolded (x(t), xrecurrence , x(t−2)after , . . . ,txsteps , x (1) ) a function g(10.6)
h
(t−1) (t) = =fg (h(x ,,xx ; θ,)x
,...,x ,x
)
(10.7) (10.6)
=f (whole h ,past x ;sequence θ) , x (1) ) The function g (t) tak takes es the ( x(t), x(t−1), x (t−2) , . . . , x(2)(10.7) as input and pro produces duces the curren currentt state, but the unfolded recurren recurrentt structure g x , x , x ...,x ,x ) The function tak es the whole past sequence ( ( t ) allo allows ws us to factorize g in into to rep repeated eated application of a function f . , The unfolding as input and produces tthe state, but the unfolded recurrent structure pro process cess th thus us introduces wo curren ma major jor tadv advan an antages: tages: allows us to factorize g into repeated application of a function f . The unfolding pro1. cess thus introduces two ma jorlength, advantages: Regardless of the sequence the learned mo model del alw always ays has the same input size, because it is specified in terms of transition from one state to 1. another Regardless of the sequence length, learned always has history the same state, rather than sp specified ecifiedthe in terms of mo a vdel ariable-length of input size, b ecause it is specified in terms of transition from one state to states. another state, rather than specified in terms of a variable-length history of 2. It is possible to use the transition function f with the same parameters states. at every time step. 2. It is possible to use the transition function f with the same parameters every timemak step. Theseat tw two o factors make e it possible to learn a single model f that op operates erates on all time steps and all sequence lengths, rather than needing to learn a separate f thatmo These factors e it time possible to learn a single modelshared opdel erates on g (to) for mo model del tw all pmak ossible steps. Learning a single, model allows all time steps and all sequence lengths, needing totraining learn a separate generalization to sequence lengths that rather did notthan appear in the set, and g mo del for all p ossible time steps. Learning a single, shared mo del allows allo allows ws the mo model del to be estimated with far fewer training examples than would be generalization to sequence lengths that did not appear in the training set, and required without parameter sharing. allows the model to be estimated with far fewer training examples than would be Both the recurren recurrentt graph and the unrolled graph hav havee their uses. The recurrent required without parameter sharing. graph is succinct. The unfolded graph provides an explicit description of which Both the recurren t graph andunfolded the unrolled graph e their uses. The recurrent computations to perform. The graph also hav helps to illustrate the idea of graph is succinct. The unfolded graph provides an explicit description of which 378graph also helps to illustrate the idea of computations to perform. The unfolded
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
information flow forward in time (computing outputs and losses) and backw backward ard in time (computing gradients) by explicitly sho showing wing the path along which this information flows. flow forward in time (computing outputs and losses) and backward information in time (computing gradients) by explicitly showing the path along which this information flows.
10.2
Recurren Recurrentt Neural Net Networks works
Armed the graph tunrolling andNet parameter sharing ideas of Sec. 10.1, we can 10.2 with Recurren Neural works design a wide variet ariety y of recurren recurrentt neural netw networks. orks. Armed with the graph unrolling and parameter sharing ideas of Sec. 10.1, we can design a wide variety of recurrent neural networks.
Figure 10.3: The computational graph to compute the training loss of a recurrent netw network ork that maps an input sequence of x values to a corresp corresponding onding sequence of output o values. Figure The computational to the compute the training loss of a recurrent netw ork A loss L10.3: measures ho is from corresp training target using how w far each ograph corresponding onding y . When x o that maps an input sequence of v alues to a corresp onding sequence of output v alues. softmax outputs, we assume o is the unnormalized log probabilities. The loss L in internally ternally o is fromthis y . When A loss L measures how(far eachcompares the to corresp onding target using computes the target The RNN has input to hidden yˆ = softmax softmax( o) and y . training softmax outputs, we assume the unnormalized log probabilities. recurrent The loss L internally connections parametrized byoaisweigh weight t matrix U , hidden-to-hidden connections softmax (o)t and computes yˆ = by compares this to the target y . The RNN hasparametrized input to hidden parametrized a weigh weight matrix hidden-to-output connections by W , and U parametrized by defines a weighforward t matrixpropagation , hidden-to-hidden recurrent connections aconnections weight matrix in this mo model. del. The RNN V . Eq. 10.8 W , and hidden-to-output parametrized by a with weighrecurrent t matrix connections. connections parametrized by and its loss drawn The same seen as an time-unfolded V a w eight matrix . Eq. 10.8 defines forward propagation in this mo del. The RNN computational graph, where eac each h no node de is no now w asso associated ciated with one particular time instance. and its loss drawn with recurrent connections. The same seen as an time-unfolded computational graph, w here eac h no de is no w asso ciated withrecurren one particular timenet instance. Some examples of important design patterns for recurrent t neural netw works
include the following: Some examples of important design patterns for recurrent neural networks Recurrent t netw networks orks that pro produce duce an output at each time step and ha have ve • Recurren include the following: 379an output at each time step and have Recurrent networks that produce
•
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
recurren recurrentt connections betw etween een hidden units, illustrated in Fig. 10.3. Recurrent netw networks orks that produce an units, outputillustrated at each time step and have ve • Recurren recurrenttconnections betwpro eenduce hidden in Fig. 10.3 . ha recurren recurrentt connections only from the output at one time step to the hidden Recurren t netw duce an output units at the nextorks timethat step,pro illustrated in Fig. at 10.4each time step and have recurren t connections only from the output at one time step to the hidden • • Recurren Recurrent t net networks works with recurrent connections betw etween een hidden units, that units at the next time step, illustrated in Fig. 10.4 read an en entire tire sequence and then pro produce duce a single output, illustrated in Fig. Recurren t networks with recurrent connections between hidden units, that 10.5 10.5.. • read an entire sequence and then produce a single output, illustrated in Fig. Fig. 10.3 representative tative example that we return to throughout 10.5.is a reasonably represen most of the chapter. Fig. 10.3 is a reasonably representative example that we return to throughout The recurren recurrentt neural netw network ork of Fig. 10.3 and Eq. 10.8 is universal in the most of the chapter. sense that any function computable by a Turing machine can be computed by such The recurren t neural network Fig.output 10.3 and 10.8from is universal the a recurrent netw network ork of a finite size.ofThe can Eq. be read the RNNinafter that any function by a Turinglinear machine can num be computed bysteps such asense num numb ber of time steps computable that is asymptotically in the numb ber of time a recurrent netw ork of a finite size. The output can b e read from the RNN after used by the Turing machine and asymptotically linear in the length of the input number of time steps that is; asymptotically linear in the numand ber Sontag of time, 1995 steps; (aSiegelmann and Sontag , 1991 Siegelmann, 1995 ; Siegelmann used by the, T1996 uring and asymptotically in the length are of the input Hy Hyotyniemi otyniemi ). machine The functions computable bylinear a Turing machine discrete, (so Siegelmann andregard Sontag , 1991 ; Siegelmann , 1995 ; Siegelmann and Sontag, 1995; these results exact implemen implementation tation of the function, not approximations. Hy otyniemi , 1996 ). The functions computable by a T uring machine are discrete, The RNN, when used as a Turing mac machine, hine, takes a binary sequence as input and its so these results regard exact implemen tation of the function, not approximations. outputs must be discretized to pro provide vide a binary output. It is possible to compute all The RNN,inwhen as ausing Turing machine, takesRNN a binary sequence as input andand its functions this used setting a single sp specific ecific of finite size (Siegelmann outputs e discretized to The provide a binary output. is possible compute all Son Sontag tag (must 1995)buse 886 units). “input” of the TuringItmachine is atosp specification ecification functions in this setting using a single sp ecific RNN of finite size ( Siegelmann and of the function to be computed, so the same netw network ork that simulates this Turing Son tag ( 1995 ) use 886 units). The “input” of the T uring is afor specification mac machine hine is sufficient for all problems. The theoretical machine RNN used the pro proof of of the function be computed, same netwits orkactiv that simulates this ts Turing can sim simulate ulate an to unbounded stackso bythe representing activations ations and weigh weights with machine num is sufficient for all problems. The theoretical RNN used for the proof rational numb bers of un unb bounded precision. can simulate an unbounded stack by representing its activations and weights with We no now w dev develop elop the forw forward ard propagation equations for the RNN depicted in rational numbers of unbounded precision. Fig. 10.3. The figure does not sp specify ecify the choice of activ activation ation function for the We no w dev elopwthe forwardthe propagation for the RNN depicted in hidden units. Here e assume hyperb hyperbolic olicequations tangent activ activation ation function. Also, Fig.figure 10.3. do The figure does exactly not specify choice of activand ationloss function fortak the the does es not sp specify ecify whatthe form the output function take. e. hidden units. Here w e assume the hyperb olic tangent activ ation function. Also, Here we assume that the output is discrete, as if the RNN is used to predict words thecharacters. figure doesAnot specify what discrete form thevariables output and function take. or natural wa way yexactly to represent is toloss regard the output Here we assume that the output discrete, as of if each the RNN is used to of predict words o as giving the unnormalized log isprobabilities possible value the discrete characters. A natural way to variables is to regard thestep output vor ariable. We can then apply therepresent softmaxdiscrete op operation eration as a post-pro ost-processing cessing to o as giving the unnormalized log probabilities of each p ossible v alue of the discrete obtain a vector yˆ of normalized probabilities ov over er the output. Forward propagation vbariable. W e can then apply the softmax op eration as a pfor ost-pro step to egins with a sp specification ecification of the initial state h(0) . Then, each cessing time step from yˆ ofapply normalized probabilities ovequations: er the output. Forward propagation tobtain = 1 toa tvector = τ , we the following up update date begins with a specification of the initial state h . Then, for each time step from a(t) following = b + up W date h(t−1)equations: + U x (t) (10.8) t = 1 to t = τ , we apply the a
380h = b+W
+ Ux
(10.8)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.4: An RNN whose only recurrence is the feedback connection from the output to the hidden lay layer. er. At each time step t , the input is x , the hidden lay layer er activ activations ations are Figure An RNN recurrence thethe feedback from thediagram. output h , the10.4: outputs are owhose , the only targets are y isand loss is Lconnection . Circuit t x to the hidden lay er. A t each time step , the input is , the hidden lay er activ ations Unfolded computational graph. Such an RNN is less pow powerful erful (can expressarea h o y L , theset outputs are , than the targets arethe family and the loss is . Fig. 10.3 Circuit diagram. smaller of functions) those in represented by . The RNN Unfolded computational graph. Such an less (can express in Fig. 10.3 can choose to put an any y information it RNN wan wants ts isab about out pow the erful past in into to its hiddena smaller set of functions) than those familyThe represented 10.3is. trained The RNN h and transmit h to in represen representation tation thethe future. RNN in by thisFig. figure to in Fig. 10.3 can choose to put an y information it wan ts ab out the past in to its hidden put a sp specific ecific output value into o , and o is the only information it is allow allowed ed to send h and are h toconnections represen tation There transmit the future.from ThehRNN thisard. figure trained to h to the future. no direct goinginforw forward. Theis previous put a sp ecifictooutput value into , and o is thethe only information it isused allowto edpro to duce. send is connected the present only oindirectly indirectly, , via predictions it was produce. h going to the future. There are no direct and connections fromusually forw ard. The information previous h Unless high-dimensional ric rich, h, it will lack imp important ortant o is very is connected the present only indirectly via theless predictions was pro duce. from the past.toThis makes the RNN in this, figure pow owerful, erful, itbut it used ma may y btoe easier to o is very Unlessbecause rich,initisolation will usually imp ortant information train eachhigh-dimensional time step can beand trained from lack the others, allowing greater from the past. during This makes the as RNN in this figure pow parallelization training, describ described ed in Sec.less 10.2.1 . erful, but it may be easier to train because each time step can be trained in isolation from the others, allowing greater parallelization during training, as described in Sec. 10.2.1.
381
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
h(t) = tanh( tanh(a a(t)) (10.9) (t) (t) o = c+Vh (10.10) h = tanh(a ) (10.9) yˆ(t) = softmax( softmax(o o(t) ) (10.11) o = c+Vh (10.10) = vectors softmax( (10.11) b oand) c along with the weigh where the parameters areyˆthe bias weightt matrices U , V and W , resp respectively ectively for input-to-hidden, hidden-to-output and hidden-toc along with where the parameters vectors bofand the matrices hidden connections. are Thisthe is bias an example a recurrent netw network orkweigh thatt maps an U , V sequence and W , resp ectively forsequence input-to-hidden, hidden-to-output and hidden-toinput to an output of the same length. The total loss for a hidden connections. This paired is an example of a recurrent netwwould ork that an giv given en sequence of x values with a sequence of y values thenmaps be just input sequence to an ooutput sequence of theFsame length.ifThe for a L(t) total the sum of the losses ver all the time steps. or example, is theloss negative giv en sequence of x v alues paired with a sequence of y v alues would then b e just log-lik log-likeliho eliho elihoo od of y(t) giv given en x (1) , . . . , x(t), then the sum of the losses over all the time steps. For example, if L is the negative , x(τ ) },,then log-likelihood of y giv L en{xx(1), ., .. .. ,. x {y(1), . . . , y (τ ) } (10.12) , y ,...,y Lx(t) , . . . , x t { } { } = L (t) (1) log pmodel y | {x , . . . , x (t)} , =− =L
(10.12) (10.13) (10.13) (10.14)
t
log p y = x ,...,x , (10.14) ( t ) (1) ( t ) ( t ) where pmodel y | {x− , . . . , x } is giv given en| by entry try for y from the { reading the} en ( t ) mo model’s del’s output vector yˆ . Computing the gradien gradientt of this loss function with x ,...,x y where p is given by reading the entry for y from the resp respect ect to the parameters is an exp expensiv ensiv ensivee op operation. eration. The gradient computation mo del’s output vector the gradien t ofleft thistoloss yˆ . Computing | { a forward } in involv volv volves es p erforming propagation pass moving righ rightfunction t throughwith our resp ect to the parameters is an exp ensiv e op eration. The gradient computation illustration of the unrolled graph in Fig. 10.3, follo followed wed by a backw backward ard propagation in volv es p erforming a forward propagation pass moving left to righ t through pass moving right to left through the graph. The runtime is O ( τ) and cannot our be illustration of the unrolled graph in Fig. 10.3 , follo wed by a backw ard propagation reduced by parallelization because the forw forward ard propagation graph is inheren inherently tly O ( τ) andone. pass moving right to left thebgraph. The runtime cannot be sequen sequential; tial; each time stepthrough ma may y only e computed after theis previous States reduced byinparallelization because ardun propagation inherenthe tly computed the forward pass mustthe be forw stored until til they aregraph reusedis during sequen tial; each time step ma y only b e computed after the previous one. States bac backward kward pass, so the memory cost is also O (τ ). The back-propagation algorithm computed forward passwith mustOb til they are op reused during the (τe)stored applied to in thethe unrolled graph cost isun called back-pr ack-prop opagation agation thr through ough O (10.2.2 τ ). The backward pass, and so the memory cost is also back-propagation algorithm time or BPTT is discussed further Sec. . The netw network ork with recurrence (τ ) cost applied the unrolled graph with called back-pr agation through b et etween ween to hidden units is thus very poO werful but is also exp expensive ensive tooptrain. Is there an time or BPTT and is discussed further Sec. 10.2.2 . The netw ork with recurrence alternativ alternative? e? between hidden units is thus very powerful but also expensive to train. Is there an alternative? The net network work with recurrent connections only from the output at one time step to the hidden units at the next time step (shown in Fig. 10.4) is strictly less powerful The network withhidden-to-hidden recurrent connections only from the output one time to because it lacks recurren recurrent t connections. For at example, it step cannot the hidden units at the next time step (shown in Fig. 10.4 ) is strictly less p o werful sim simulate ulate a universal Turing machi machine. ne. Because this net network work lacks hidden-to-hidden because it lacks hidden-to-hidden recurrent connections. For example, it cannot 382 simulate a universal Turing machine. Because this network lacks hidden-to-hidden
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
recurrence, it requires that the output units capture all of the information ab about out the past that the netw network ork will use to predict the future. Because the output units recurrence, requirestothat the output units of the information about are explicitlyittrained matc match h the training setcapture targets,all they are unlikely to capture the past that the network will use the to predict the future. Because output the necessary information ab about out past history of the input, the unless the units user are explicitly trained to matc h the training set targets, they are unlikely to capture kno knows ws ho how w to describ describee the full state of the system and pro provides vides it as part of the the necessary information aban outtage theofpast history of the input, unless the user training set targets. The adv advan antage eliminating hidden-to-hidden recurrence kno ws ho w any to describ e the full stateonofcomparing the systemthe andprediction provides it part of the the t to is that, for loss function based atas time training target set targets. The advthe antage eliminating hidden-to-hidden recurrence t, all training at time timeofsteps are decoupled. Training can thus be t is that, for any loss function based on comparing the prediction at time parallelized, with the gradient for each step t computed in isolation. Theretoisthe no t, all for training target atthe time thethe time steps are decoupled. thus be need to compute output previous time step first,Tbraining ecause can the training parallelized, withideal the v gradient for each step t computed in isolation. There is no set provides the alue of that output. need to compute the output for the previous time step first, because the training set provides the ideal value of that output.
Figure 10.5: Time-unfolded recurren recurrentt neural netw network ork with a single output at the end of the sequence. Such a netw network ork can b e used to summarize a sequence and pro produce duce a Figure 10.5: Time-unfolded t neural netwpro orkcessing. with a single end fixed-size representation usedrecurren as input for further processing. There output might bat e athe target of the Such a netwhere) ork can b e gradient used to summarize a sequence pro ducebya o can band righ right t atsequence. the end (as depicted or the on the output e obtained fixed-size representation used as input for further pro cessing. There might b e a target bac from further downstream mo back-propagating k-propagating modules. dules. right at the end (as depicted here) or the gradient on the output o can be obtained by back-propagating frome further downstream modules. Mo Models dels that hav have recurrent connections from their outputs leading bac back k in into to
the model ma may y be trained with te teacher acher for forcing cing cing.. Teacher forcing is a procedure Mo dels that hav e recurrent connections from their leading back in to that emerges from the maximum likelihoo likelihood d criterion, in outputs whic which h during training the the model may the be trained acher yfor procedure (t)cing. Teacher forcing mo model del receives ground with truth te output as input at time t + 1is. aW We e can see that emerges from the maximum likelihoo d criterion, in whic h during training the this by examining a sequence with two time steps. The conditional maximum y t mo del receives the ground truth output as input at time + 1 . W e can see lik likeliho eliho elihoo od criterion is this by examining a sequence with two time steps. The conditional maximum (1) (2) likelihood criterion log p y is , y | x(1) , x (2) (10.15) log p y
,y
x ,x |
383
(10.15)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.6: Illustration of teacher forcing. Teacher forcing is a training tec technique hnique that is applicable to RNNs that ha have ve connections from their output to their hidden states at the Figure 10.6: Illustration teacher eacher forcing is a ytraining tecfrom hnique is next time step. At of train time,forcing. we feedTthe dra drawn wn thethat train applicable from output at the h that set as inputtotoRNNs . have connections When the mo model del their is deplo deployed, yed, to thetheir truehidden outputstates is generally nextkno time Atwtrain time, we feed dra wnmo from train not In this case, e approximate the the correct output y ywith the output known. wn.step. model’s del’sthe h set as input to . When the mo del is deplo yed, the true output is generally o , and feed the output back in into to the mo model. del. not known. In this case, we approximate the correct output y with the model’s output o , and feed the output back into the model.
384
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
= log p y (2) | y (1), x (1), x (2) + log p y (1) | x (1), x(2)
(10.16)
= log p we y see ythat, xat time , x t =+22,log p mo y delxis trained ,x (10.16) In this example, , the model to maximize the (2) | so far and the previous y conditional probability |of y giv given en the x sequence t In this example, we see that at time = 2 , the mo del is trained maximize the value from the training set. Maximum likelihoo likelihood d thus sp specifies ecifies thattoduring training, conditional the xbac sequence far and theconnections previous y y given rather thanprobability feeding theof model’s own output back k into so itself, these vshould alue from the with training Maximum likelihoo thus sp that duringshould training, be fed the set. target values sp specifying ecifyingdwhat theecifies correct output be. rather than feeding the model’s own output bac k into itself, these connections This is illustrated in Fig. 10.6. should be fed with the target values specifying what the correct output should be. We originally motiv motivated ated teacher forcing as allo allowing wing us to av avoid oid back-propagation This is illustrated in Fig. 10.6. through time in mo models dels that lack hidden-to-hidden connections. Teac eacher her forcing W e originally motiv ated teacher forcing as allo wing us to av oid back-propagation ma may y still be applied to mo models dels that hav havee hidden-to-hidden connections so long as through time in mo dels that lack hidden-to-hidden connections. Teacher forcing they ha have ve connections from the output at one time step to values computed in the ma y still b e applied to mo dels that hav e hidden-to-hidden connections so long as next time step. Ho Howev wev wever, er, as soon as the hidden units become a function of earlier they ha ve connections the output at one .time step to values in the time steps, the BPTT from algorithm is necessary necessary. Some mo models dels ma may y computed th thus us be trained next btime Hoforcing wever, as as the hidden units become a function of earlier with oth step. teacher andsoon BPTT. time steps, the BPTT algorithm is necessary. Some models may thus be trained disadv disadvan an antage tage of strict teacher forcing arises if the net network work is going to be withThe b oth teacher forcing and BPTT. later used in an op open-lo en-lo en-loop op mo mode, de, with the netw network ork outputs (or samples from the The distribution) disadvantagefed of strict teacher thekind network is going tothe be output bac back k as input.forcing In thisarises case,ifthe of inputs that later used in during an open-lo op mocould de, with the netw ork outputs (orkind samples fromthat the net network work sees training be quite different from the of inputs output distribution) fed bac k as input. In this case, the kind of inputs that the it will see at test time. One wa way y to mitigate this problem is to train with both net work sees during training could be quite different from the kind of inputs that teac inputs and with free-running inputs, for example by predicting the teacher-forced her-forced it will see at test time. One wa y to mitigate this problem is to train with b oth correct target a num numb ber of steps in the future through the unfolded recurrent teac her-forced inputs andInwith inputs, for learn example by epredicting the output-to-input paths. this free-running way ay,, the net netw work can to tak take into account correctconditions target a num ber steps the future the unfoldedmo recurrent input (such as of those it in generates itselfthrough in the free-running mode) de) not output-to-input paths. In this w ay , the net w ork can learn to tak e into account seen during training and ho how w to map the state bac back k to towards wards one that will mak makee input conditions (such as er those it generates itself in the free-running mo(de) not the net network work generate prop proper outputs after a few steps. Another approach Bengio seen and ho w gap to map the back toseen wards mak et al. al.,during , 2015btraining ) to mitigate the betw etween eenstate the inputs at one trainthat timewill and thee the net work generate prop er outputs after a few steps. Another approach ( Bengio inputs seen at test time randomly chooses to use generated values or actual data al., as 2015b ) toThis mitigate the gap betwaeen the inputs seen at train time and the vetalues input. approach exploits curriculum learning strategy to gradually inputs seen time randomly use more of at thetest generated values aschooses input. to use generated values or actual data values as input. This approach exploits a curriculum learning strategy to gradually use more of the generated values as input. Computing the gradien gradientt through a recurren recurrentt neural netw network ork is straightforw straightforward. ard. One simply applies the generalized bac back-propagation k-propagation algorithm of Sec. 6.5.6 to the Computing the gradien t through a recurren t neural netware ork necessary is straightforw unrolled computational graph. No sp specialized ecialized algorithms necessary. . The ard. use One simply applies the bac k-propagation algorithm ofop Sec. 6.5.6thr toough the of back-propagation ongeneralized the unrolled graph is called the back-pr ack-prop opagation agation through unrolled computational No spobtained ecialized by algorithms are necessary time (BPTT) algorithm.graph. Gradients back-propagation may. The thenuse be of back-propagation on the unrolled graph is called the b ack-pr op agation thr ough used with any general-purp general-purpose ose gradient-based techniques to train an RNN. time (BPTT) algorithm. Gradients obtained by back-propagation may then be 385 used with any general-purpose gradient-based techniques to train an RNN.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
To gain some intuition for how the BPTT algorithm behav ehaves, es, we provide an example of how to compute gradien gradients ts by BPTT for the RNN equations ab abov ov ovee T o gain some intuition for how the BPTT algorithm b ehav es, we provide an (Eq. 10.8 and Eq. 10.12). The nodes of our computational graph include the example of U how compute ts as bythe BPTT for the RNN t ov parameters , Vto , W , b and cgradien as well sequence of no nodes des equations indexed byab fore (Eq. 10.8 and Eq. 10.12 ). The nodes of our computational graph include the x(t) , h(t) , o (t) and L(t) . For each no node de N we need to compute the gradient ∇ L U , V , on Wthe c ast well t for parameters , b and as theatsequence of no des indexed recursiv recursively ely ely,, based gradien gradient computed no nodes des that follow it in thebygraph. N x e ,start h ,the o recursion L and L .with For the eachno no deimmediately we need topreceding computethe thefinal gradient W nodes des loss recursively, based on the gradient computed at nodes that follow it in the graph. ∇ ∂ Limmediately preceding the final loss We start the recursion with the nodes = 11.. (10.17) ∂ L(t) ∂L = 1. o(t) are used as the argument(10.17) In this deriv derivation ation we assume that the∂ Loutputs to the softmax function to obtain the vector yˆ of probabilities ov over er the output. We also o In this deriv ation we assume that the outputs are used as the argument to the the y(t) giv assume that the loss is the negative log-likelihoo log-likelihood d of the true target given en ˆ softmax function to obtain the vector of probabilities ov er the output. W e also y input so far. The gradient ∇ L on the outputs at time step t, for all i, t , is as assume follo follows: ws: that the loss is the negative log-likelihood of the true target y given the input so far. The gradient step t, for all i, t , is as ∂ LL on the ∂ L outputs ∂ L(t) at (time t) L ) = = = y ˆ ( ∇ (10.18) − i i,y . i follows: ∇ ∂ o(t) ∂ L(t) ∂ o(t) i ∂L ∂ L ∂ Li L ) = = = yˆof the sequence. ( . (10.18) We work our wa way y backw backwards, ards, starting∂from end At the final L ∂the ∂ o o (τ ) ∇ (τ ) − time step τ , h only has o as a descendent, so its gradient is simple: We work our way backwards, starting from the end of the sequence. At the final ∂ o(τ ) time step τ , h only as L) (τ ) =so ∇has Lo = ((∇ ∇ a descendent, ((∇ ∇its gradient L) V . is simple: (10.19) ∂h ∂o L)to back-propagat L = (in time =( Le) V . (10.19) We can then iterate backw backwards ards back-propagate gradients through time, ∂ h (t) from t = τ − 1 do down wn ∇ to t = 11,, noting descendents ts both ∇ that h (for∇t < τ ) has as descenden W iterate backwards in time to b back-propagat e gradients through time, o(et) can andthen h(t+1) . Its gradient is thus given y from t = τ 1 down to t = 1, noting that h (for t < τ ) has as descendents both ∂ h(t+1) given by ∂ o(t) o and∇h − L .=Its L) is thus (10.20) ((∇ ∇gradient + ( ∇ L) ∂ h(t) ∂ h(t) ∂h 2∂ o L) L ) (10.20) L= = ((∇ (∇ + ( (t+1) L) diag (10.21) ∂h 1 − h ∂ hW + (∇ L) V ∇ ∇ ∇ =( W +( h L) V (10.21) 2L) diag 1 where diag 1 − h(t+1) indicates the diagonal matrix con containing taining the elements ∇ − ∇ (t+1) 2 − (hidiag) .1 Thish is the Jacobian 1where thediagonal hyperb hyperbolic olic tangent asso associated ciated with the indicatesofthe matrix containing the elements hidden unit i at t + 1. − time 1 (h ) . This is the Jacobian of the hyperbolic tangent associated with the Once the gradients on the internal no nodes des of the computational graph are obhidden unit i at time t + 1. − tained, we can obtain the gradien gradients ts on the parameter no nodes, des, which ha hav ve descenden descendents ts Once the gradients on the internal no des of the computational graph are obat all the time steps: tained, we can obtain the gradients on the parameter nodes, which have descendents ∂ o(t) at all ∇ theLtime ∇ L = steps:(∇ L) = ∂ c t t ∂o ( L = L) = 386 L ∂c ∇ ∇ ∇
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
2 ∂ h (t) (∇ L) diag 1 − h(t) = ∂b t t ∂h ( ( L = L) ∂ o(t) = L) diag 1 h ∂b = (t) L ) L ) h ∇∇ L = (∇ ( ∇ ∇ ∇ − ∂V t t ∂o L = L) ∂ h (t) = L) h ( ( 2 ∂ V (t) L = L ) = L ) diag h h(t−1) ∇ ( ∇ ( ∇ 1 − ∇ ∇ ∇ ∂ W t t ∂h L = L) ∂ h (t) = L) diag 1 h 2 h ( ( ∂ W (t) ∇ L = (∇ (∇ x(t) ∇ ∇ L) ∂ U = ∇ L) diag 1 − − h t t ∂h ( ( L = L) = L) diag 1 h x ∂U ( t ) ∇ not need to compute ∇ ∇ resp We do the gradient with respect ect to x− for training because it do does es not hav havee an any y parameters as ancestors in the computational graph defining do W e not need to compute the gradient with respect to x for training because the loss. it does not have any parameters as ancestors in the computational graph defining We are abusing notation somewhat in the ab abov ov ovee equations. We correctly use the loss. ( t ) ∇ L to indicate the full influence of h through all paths from h (t) to L. This We are abusing notation somewhat in the above equations. We correctly use is in contrast to our usage of ∂∂ or ∂ , whic which h we use here in an unconv unconven en entional tional L to indicate the full influence of∂h through all paths from h to L. This ∂ ( t ) manner. By ∂ we refer to the effect of W on h only via the use of W at time is in contrast to our usage of or , which we use here in an unconventional ∇ step t. This is not standard calculus notation, because the standard definition of W(t)atvia manner. By would we actually refer to the effectthe of W on h influence only via the useon of h time the Jacobian include complete of W its stepint. all This is not standardtime calculus because standard definition use of the preceding steps notation, to pro produce duce h(t−1) .the What we refer to here of is W h the Jacobian would actually include the complete influence of on via its in fact the metho method d of Sec. 6.5.6, that computes the contribution of a single use in all of the preceding to gradient. produce h . What we refer to here is edge in the computational time graphsteps to the in fact the method of Sec. 6.5.6, that computes the contribution of a single edge in the computational graph to the gradient.
∇L =
(∇
L)
In the example recurrent netw network ork we hav havee developed so far, the losses L(t) were cross-en cross-entropies tropies betw between een training targets y(t) and outputs o(t) . As with a feedforward In the example orktoweuse hav e developed far, athe losses Lnetw work. ere net netwo wo work, rk, it is in recurrent principle pnetw ossible almost any losssowith recurrent network. cross-en tropies betw een training targets and outputs . As with a feedforward y o The loss should be chosen based on the task. As with a feedforward netw network, ork, we network,wish it istoininterpret principle the p ossible to of usethe almost with a recurrent netwand ork. usually output RNN any as aloss probability distribution, The loss should becross-entrop chosen based on ciated the task. with a feedforward netwthe ork,loss. we w e usually use the cross-entropy y asso associated withAs that distribution to define usuallysquared wish toerror interpret output ofythe as a probability distribution, and Mean is thethe cross-entrop cross-entropy lossRNN asso associated ciated with an output distribution w e usually use the cross-entrop y asso ciated with that distribution to define the loss. that is a unit Gaussian, for example, just as with a feedforward netw network. ork. Mean squared error is the cross-entropy loss associated with an output distribution we use a predictive log-lik training as Eq. log-likeliho eliho elihoo o d as objectiv jectiv jective, e, suchnetw thatWhen is a unit Gaussian, for example, just with aob feedforward ork.10.12, we train the RNN to estimate the conditional distribution of the next sequence elemen elementt When we use a predictive log-lik eliho o d training ob jectiv e, such as Eq. 10.12 ,dwe ( t ) y giv given en the past inputs. This may mean that we maximize the log-likelihoo log-likelihood train the RNN to estimate the conditional distribution of the next sequence element (t) log pmay (y (t) mean ...,x , (10.22) | x(1) ,that y given the past inputs. This we )maximize the log-likelihoo d 387 x
log p(y |
, . . . , x ),
(10.22)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
or, if the mo model del includes connections from the output at one time step to the next time step, or, if the model includes output one 1) time step to the next t) ). (10.23) log pconnections (y(t) | x (1) , . from . . , x(the , y (1) , . . . ,at y(t− time step, Decomp Decomposing osing the join joint over of (10.23) x , .ov logt pprobability (y . .er , xthe, ysequence , . . . , y of y )v. alues as a series one-step probabilistic predictions is one wa way y to capture the full joint distribution | Decomp osing the join t probability ov er the aluesasasinputs a series of across the whole sequence. When we do notsequence feed pastofyyvvalues that one-step probabilistic onedirected way to graphical capture the full joint distribution condition the next steppredictions prediction,isthe mo model del con contains tains no edges y across the whole sequence. When w e do not feed past v alues as inputs ( i ) ( t ) from an any y y in the past to the current y . In this case, the outputs y that are condition the next step prediction, the directed graphical mo del con tains no edges conditionally indep independent endent given the sequence of x values. When we do feed the y y actual from an y in the pastprediction, to the current . In this case, the outputs vyalues) are actual y values (not their but the observ observed ed or generated conditionally endent the graphical sequence mo of xdel values. When we from do feed the ( i) bac back k in into to the indep netw network, ork, the given directed model contains edges all y y vthe alues (not prediction, the actual observed or generated values) vactual alues in past to their the current y(t) vbut alue. back into the network, the directed graphical model contains edges from all y values in the past to the current y value.
Figure 10.7: Fully connected graphical mo model del for a sequence y , y , . . . , y , . . . : ev every ery past observ observation ation y ma may y influence the conditional distribution of some y (for t > i), y directly , y , . .according . , y , . . . :toevthis Figure 10.7: Fully connected graphical mothe del graphical for a sequence ery giv given en the previous values. Parametrizing mo model del y y t > i), past observ y influence conditional of some graph (as ination Eq. 10.6)ma might be verythe inefficient, withdistribution an ever gro growing wing num numb ber(for of inputs givenparameters the previous alues. Parametrizing the graphical del directly to this and for veach element of the sequence. RNNs mo obtain the sameaccording full connectivity graph (as in Eq. 10.6 ) might b e very inefficient, with an ever gro wing num b er of inputs but efficient parametrization, as illustrated in Fig. 10.8. and parameters for each element of the sequence. RNNs obtain the same full connectivity but As efficient parametrization, illustrated Fig.case 10.8where . a simple example, letasus considerinthe the RNN mo models dels only a
sequence of scalar random variables = {y(1) , . . . , y (τ )}, with no additional inputs As ainput simple example, consider the case thet − RNN models x . The at time step tlet is us simply the output at where time step 1. The RNNonly thena = the y y, .v.ariables. sequencea of scalar random .,y , with no additional inputs defines directed graphicalvariables mo model del over We parametrize the joint t t xdistribution . The input of at these time step is simply the output at time step 1 . The RNN then { } observ observations ations using the chain rule (Eq. 3.6) for conditional defines a directed graphical model over the y variables. We parametrize the joint − probabilities: distribution of these observations using the chain rule (Eq. 3.6) for conditional τ probabilities: (1) (τ ) ) = (10.24) P( ) = P( ,..., P ( (t) | (t−1), (t−2) , . . . , (1) ) P( ) = P(
,...,
)=
t=1
P 388 (
, |
,...,
)
(10.24)
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
where the righ right-hand t-hand side negativ negativee log-likelihoo log-likelihood d of a where the righ t-hand side is negative log-likelihood of a is where where
of the bar is empty for t = 1, of course. Hence the set of values {y (1) , . . . , y (τ )} according to such a mo model del of the bar is empty for t = 1, of course. Hence the set of vL alues according to such a(10.25) model = y L (,t.) . . , y {t } L= L (10.25)
L (t) = − log P (y(t) = y (t) | y(t−1), y (t−2) , . . . , y (1)). L
log P (y
= −
y
=y
,y
,...,y
).
(10.26) (10.26)
|
Figure 10.8: Introducing the state variable in the graphical model of the RNN, ev even en though it is a deterministic function of its inputs, helps to see how we can obtain a very Figure Introducing based the state variable the model of the(for RNN, en efficien efficientt10.8: parametrization, on Eq. 10.5. in Ev Every ery graphical stage in the sequence and h ev though it isesathe deterministic function its num inputs, to see can and obtain very y ) in involv volv volves same structure (the of same number ber helps of inputs forhow eachwe no node) de) canashare efficien t parametrization, on Eq. 10.5. Every stage in the sequence (for h and the same parameters with based the other stages. y ) involves the same structure (the same number of inputs for each node) and can share the The sameedges parameters with themo other in a graphical model del stages. indicate which variables depend directly on other
variables. Many graphical models aim to achiev achievee statistical and computational The edges in a graphical mo del indicate which vond ariables dependin directly on other efficiency by omitting edges that do not corresp correspond to strong interactions. teractions. For vexample, ariables. itMany graphical models aim to achiev e statistical and computational is common to make the Marko Markov v assumption that the graphical mo model del efficiency b y omitting edges that do not corresp ond to strong in teractions. F or ( t − k ) ( t− 1) ( t ) } to y , rather than containing should only con contain tain edges from {y ,...,y example, is common to make the Marko assumption that graphical mopast del edges fromitthe entire past history history. . How However, ever,v in some cases, we the believ elieve e that all y the, .next should should only conhav tain edges from on . . , y element to of y the , rather than containing inputs have e an influence sequence. RNNs are edges from the entire past history . How ever, in some (t) cases, we b elieve that all past { } useful when we believe that the distribution ov over er y ma may y dep depend end on a value of y(i) inputs should hav e an influence on the next element of the sequence. 1) from the distant past in a wa way y that is not captured by the effect of y(i) RNNs on y(t−are . useful when we believe that the distribution over y may depend on a value of y One way to in interpret terpret an RNN as a graphical mo model del is to view the RNN as from the distant past in a way that is not captured by the effect of y on y . defining a graphical mo model del whose structure is the complete graph, able to represent One way to interpret an RNN as aofgraphical is to view the as direct dep dependencies endencies betw etween een any pair y values. mo Thedel graphical mo model del oRNN ver the defining a graphical mo del whose structure is the complete graph, able to represent y values with the complete graph structure is shown in Fig. 10.7. The complete direct dep endencies bof etw eenRNN any pair of y von alues. The graphical mounits del oh ver (t) the graph in interpretation terpretation the is based ignoring the hidden by ymarginalizing values with the complete graph structure is shown in Fig. 10.7 . The complete them out of the mo model. del. graph interpretation of the RNN is based on ignoring the hidden units h by It is more interesting to consider the graphical mo model del structure of RNNs that marginalizing them out of the model. ( t ) results from regarding the hidden units h as random variables. Including the It is more interesting to consider the graphical model structure of RNNs that The from conditional distribution over these variables given their parents is deterministic. This h as results regarding the hidden units random variables. Including theis 389
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
hidden units in the graphical mo model del reveals that the RNN provides a very efficient parametrization of the joint distribution ov over er the observ observations. ations. Supp Suppose ose that we hidden units in the graphical mo del reveals that the RNN provides a v represen represented ted an arbitrary join jointt distribution over discrete values withery a efficient tabular parametrization of the joint distribution ov er the observ ations. Supp ose that we represen representation—an tation—an array containing a separate entry for eac each h possible assignment represen ted an arbitrary join t distribution o v er discrete v alues with a tabular of values, with the value of that entry giving the probability of that assignment tation—an containing a tseparate entrytabular for eacrepresentation h possible assignment k differen orepresen ccurring. If y can array take on different values, the would of v alues, with the v alue of that entry giving the probability of that assignment τ ha have ve O (k ) parameters. By comparison, due to parameter sharing, the number o ccurring. If yincan oniskOdifferen values, the tabular representation would of parameters thetake RNN (1) as at function of sequence length. The num number ber O ( k ha ve ) parameters. By comparison, due to parameter sharing, the n umber of parameters in the RNN may be adjusted to control mo model del capacity but is not of parameters thesequence RNN is O (1) as aEq. function of sequence length. The number forced to scale in with length. 10.5 shows that the RNN parametrizes of parameters in the RNN may adjusted to control model capacity but is not long-term relationships betw etween een bveariables efficiently efficiently, , using recurrent applications forced scalefunction with sequence length.parameters Eq. 10.5 shows RNN parametrizes f and same θ at that of the to same eac each h the time step. Fig. 10.8 long-term relationships b etw een v ariables efficiently , using recurrent applications ( t ) illustrates the graphical model inter interpretation. pretation. Incorp Incorporating orating the h no nodes des in f and same θ at eac of the same function parameters h time Fig. 10.8 the graphical mo model del decouples the past and the future, acting as step. an intermediate illustrates theeen graphical inter Incorp the h ano des in quan quantit tit tity y betw etween them. Amodel variable in the distant pastorating may influence variable y (i)pretation. the pastofand future, acting anmo intermediate (t) graphical mo del h ythe via its effect on decouples . The structure thisthe graph shows thatasthe model del can be quan tit y b etw een them. A v ariable in the distant past may influence a variable y efficien efficiently tly parametrized by using the same conditional probability distributions at yeac h via its effect on . The structure of this graph shows that the mo del can b e time step, and that when the v ariables are all observ the probability of the each h observed, ed, efficien tly parametrized using can the bsame conditional probability distributions at join joint t assignment of all vby ariables e ev evaluated aluated efficiently efficiently. . each time step, and that when the variables are all observed, the probability of the Even en with theofefficien efficient t parametrization of the graphical mo model, operations erations joinEv t assignment all variables can be evaluated efficiently . del, some op remain computationally challenging. For example, it is difficult to predict missing Even the efficien t parametrization of the graphical model, some operations values in with the middle of the sequence. remain computationally challenging. For example, it is difficult to predict missing The price recurrent netw networks orks pa pay y for their reduced num numb ber of parameters is values in the middle of the sequence. that the parameters may be difficult. The price recurrent networks pay for their reduced number of parameters is The parameter sharing used in recurrent netw networks orks relies on the assumption that the parameters may be difficult. that the same parameters can b e used for different time steps. Equiv Equivalen alen alently tly tly,, the The parameter sharing used in recurrent netw orks relies on the assumption assumption is that the conditional probabilit probability y distribution ov over er the variables at that the same parameters can b e used for different time steps. alently, the time t + 1 given the variables at time t is stationary, meaning thatEquiv the relationship assumption is that the conditional probabilit y distribution ov er the v ariables at bet etween ween the previous time step and the next time step do does es not dep depend end on t . In time t + 1 itgiven thebevariables is stationary , meaning thattime the step relationship principle, would possibleattotime use tt as an extra input at each and let t . een b et ween the previous time step and the next time step do es not dep end on In the learner discov discover er any time-dep time-dependence endence while sharing as much as it can betw etween t principle, it would be This possible to use as an at each and lett differen different t time steps. would already beextra muchinput better than time usingstep a differen different the learner discov er any time-dep endence while sharing as m uch as it can b etweeen conditional probability distribution for each t, but the netw network ork would then hav have to differen t time steps. This would extrap extrapolate olate when faced with new valready alues ofbte. much better than using a different conditional probability distribution for each t, but the network would then have to To complete our view of an RNN as a graphical mo model, del, we must describ describee how extrapolate when faced with new values of t. to dra draw w samples from the mo model. del. The main op operation eration that we need to perform is To complete our view of an RNN as a graphical model, we must describe how perfectly thoughthe it ismo somewhat raremain to design a graphical model such to draw legitimate, samples from del. The operation that wewith need todeterministic perform is hidden units.
390
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
simply to sample from the conditional distribution at each time step. How Howev ev ever, er, there is one additional complication. The RNN must hav havee some mechanism for simply to sample from of thethe conditional eachedtime step. How ever, determining the length sequence. distribution This can be at achiev achieved in various wa ways. ys. there is one additional complication. The RNN must have some mechanism for In the case is a sym symbol bol can tak taken en cabulary cabulary, one determining thewhen lengththe of output the sequence. This be from achievaedvoin various, wa ys.can add a sp special ecial symbol corresp corresponding onding to the end of a sequence (Schmidh Schmidhub ub uber er, 2012). In that the case when the output the is asampling symbol tak en from a voIncabulary , one can When symbol is generated, pro process cess stops. the training set, add a sp ecial symbol corresp onding to the end of a sequence ( Schmidh ub er , 2012 we insert this symbol as an extra member of the sequence, immediately after x(τ )). When symbol is generated, the sampling process stops. In the training set, in eachthat training example. we insert this symbol as an extra member of the sequence, immediately after x Another option is to in intro tro troduce duce an extra Bernoulli output to the mo model del that in each training example. represen represents ts the decision to either con contin tin tinue ue generation or halt generation at eac each h Another option is to in tro duce an extra Bernoulli output to the mo del that time step. This approach is more general than the approach of adding an extra represen ts the the v decision to, either conittin uey generation at than each sym symb bol to ocabulary cabulary, because ma may be appliedor to halt an any y generation RNN, rather time step. This approach is more general than the approach of adding an extra only RNNs that output a sequence of symbols. For example, it may be applied to symRNN bol to theemits vocabulary , because it nma y be applied to output any RNN, than an that a sequence of real umbers. The new unit rather is usually a only RNNs that output a sequence of symbols. F or example, it may b e applied to sigmoid unit trained with the cross-entrop cross-entropy y loss. In this approac approach h the sigmoid is an RNN that emits a sequence of real n umbers. The new output is usually a trained to maximize the log-probability of the correct prediction asunit to whether the sigmoid unit with cross-entrop y loss. In this approach the sigmoid is sequence endstrained or contin continues uesthe at each time step. trained to maximize the log-probability of the correct prediction as to whether the Another way to determine the sequence length τ is to add an extra output to sequence ends or continues at each time step. the mo model del that predicts the integer τ itself. The mo model del can sample a value of τ τ Another w ay to determine the sequence length is add an adding extra output to and then sample τ steps worth of data. This approachtorequires an extra τ itself. the motodelthe that predicts the integer The so mothat del can atvup alue of τis input recurren recurrent t update at each time step the sample recurren recurrent update date then sample τitsteps worth data. This approach requires This adding an extra aand ware of whether is near the of end of the generated sequence. extra input input to the recurren t update at each time step so that the recurren t up date is can either consist of the value of τ or can consist of τ − t, the num number ber of remaining awaresteps. of whether it isthis nearextra the end of the sequence. extra input time Without input, the generated RNN migh might t generateThis sequences that τ τ t can either consist of the v alue of or can consist of , the num ber of remaining end abruptly abruptly,, such as a sentence that ends before it is complete. This approach is time steps. this extra input, the RNN migh − t generate sequences that based on theWithout decomp decomposition osition end abruptly, such as a sentence that ends before it is complete. This approach is (1) based on the decomp P (xosition , . . . , x (τ )) = P (τ )P (x(1) , . . . , x(τ ) | τ ). (10.27) P (x , . . .τ, xdirectly ) = Pis (τ )used P (x for, . example ..,x τ ). Goo The strategy of predicting by Goodfellow dfellow(10.27) et al. | (2014d). The strategy of predicting τ directly is used for example by Goodfellow et al. (2014d). In the previous section we describ described ed how an RNN could corresp correspond ond to a directed ( t ) graphical model over a sequence of random variables y with no inputs x . Of In the previous section w an RNN corresp ond to aofdirected course, our developmen development t eofdescrib RNNsedashow in Eq. 10.8 could included a sequence inputs y graphical model o v er a sequence of random v ariables with no inputs . del Of (1) (2) ( τ ) x , x , . . . , x . In general, RNNs allow the extension of the graphical x mo model course, developmen t of aRNNs in Eq. 10.8 included sequencebut of inputs view to our represen represent t not only joint as distribution ov over er the y avariables also a x , x , . . . , x . In general, RNNs allow the extension of the graphical model 391 view to represent not only a joint distribution over the y variables but also a
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
conditional distribution over y giv given en x. As discussed in the context of feedforw feedforward ard net networks works in Sec. 6.2.1.1, any mo model del represen representing ting a variable P (y ; θ) can be reinterconditional over y giv en x. As discussed in the of feedforw preted as a distribution mo model del representing a conditional distribution We y|ω ) with P (context ω = θ. ard P ( y ; θ net works in Sec. 6.2.1.1 , any mo del represen ting a v ariable ) can b e reintercan extend such a mo model del to represent a distribution P ( y | x) by using the same preted as a mo del representing conditional with We ω ) of P (y = θ. this P ( y | ω) as before, but makinga ω a functiondistribution of x. In the case an ω RNN, ( y xcommon can b extend suchin a mo del to represent a distribution same |) by using can e ac achieved hieved differen different t wa ways. ys. We review here thePmost andthe obvious ( y ω) as before, but making ω a function of x. In the| case of an RNN, this cPhoices. can b|e achieved in different ways. We review here the most common and obvious Previously Previously,, we hav havee discussed RNNs that tak takee a sequence of vectors x (t) for choices. takee only a single vector x as input. t = 1 , . . . , τ as input. Another option is to tak x RNN Previously , we havevector, discussed RNNs thatmake take it a sequence of vectors for x is a fixed-size When we can simply an extra input of the input.sequence. AnotherSome option is to tak eysonly a single van ector input. t = 1generates , . . . , τ as the x as that common wa ways of providing extra input to x When is a fixed-size vector, we can simply make it an extra input of the RNN an RNN are: that generates the sequence. Some common ways of providing an extra input to an 1. RNN are:extra input at eac as an each h time step, or 2. initialinput stateath(0) 1. as the an extra eac, hortime step, or 3. both. oth. 2. b as the initial state h
, or
The3.first and most common approac approach h is illustrated in Fig. 10.9. The interaction both. bet etween ween the input x and eac each h hidden unit vector h (t) is parametrized by a newly The first and most common approac is illustrated in mo Fig. . Thethe interaction in intro tro troduced duced weigh weightt matrix R that washabsent from the model del10.9 of only sequence betyween the The inputsame each hidden unit vector is parametrized by ahidden newly x and x> R is of values. product added as h additional input to the Re that introduced weigh t matrix was absent deldetermining of only the the sequence x as units at every time step. W can think of the from choicethe of mo value y x R of v alues. The same product is added as additional input to the hidden > of x R that is effectively a new bias parameter used for each of the hidden units. x asofdetermining unitswat every time step. We can of the W choice value The eights remain indep independen enden endent t ofthink the input. e canofthink this mo model del the as taking R that is effectively of xparameters a new bias parameter for eachthem of theinto hidden units. the mo model del used and turning θ of the non-conditional ω, where The weights remain indep enden t ofnow the input. We can think of this model as taking the bias parameters within ω are a function of the input. the parameters θ of the non-conditional model and turning them into ω, where Rather than receiving only a single vector x as input, the RNN ma may y receive a the bias parameters within ω are now a function of the input. sequence of vectors x(t) as input. The RNN describ in Eq. 10.8 corresp described ed corresponds onds to a x Rather than receiving only a single vector as input, the RNN ma y receive a (1) ( τ ) (1) ( τ ) conditional distribution P (y , . . . , y | x , . . . , x ) that makes a conditional x as input. sequence of vectors The RNN describ ed in Eq.as10.8 corresponds to a indep independence endence assumption that this distribution factorizes x , . . . , x ) that makes a conditional conditional distribution P (y , . . . , y t) (t) | x|(1), . . . , x (y (distribution ). (10.28) independence assumption that Pthis factorizes as t
x , . . . , x ). P (y (10.28) To remov removee the conditional indep independence endence assumption, we can add connections from the output at time t to the hidden unit| at time t + 1, as shown in Fig. 10.10. The T o remov e the endence assumption, we can add from mo model del can thenconditional represent indep arbitrary probability distributions ov over erconnections the y sequence. the output at model time t to the hidden aunit at time t +ov 1er , asa shown in Fig. 10.10 . The This kind of representing distribution over sequence given another model canstill then arbitrary probability distributions er sequences the y sequence. sequence hasrepresent one restriction, which is that the length of bov oth must This kind of model representing a distribution ov er a sequence given another be the same. We describ describee how to remo remove ve this restriction in Sec. 10.4. sequence still has one restriction, which is that the length of both sequences must be the same. We describe how to remo392 ve this restriction in Sec. 10.4.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.9: An RNN that maps a fixed-length vector x in into to a distribution ov over er sequences . This RNN is appropriate for tasks such as image captioning, where a single image is x into Figureas10.9: that mapsthen a fixed-length distribution overthe sequences used inputAn toRNN a mo model del that pro produces duces avector sequence of awords describing image. . This RNN for tasks suchsequence as imageserves captioning, single is Eac Each h elemen element t yis appropriate of the observed output both aswhere inputa(for theimage current used step) as input a mo del that then pro duces sequence words describing the image. time and,toduring training, as target (for athe previousoftime step). Each element y of the observed output sequence serves both as input (for the current time step) and, during training, as target (for the previous time step).
393
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.10: A conditional recurrent neural netw network ork mapping a variable-length sequence of x values into a distribution ov over er sequences of y values of the same length. Compared Figure A conditional recurrent neuralfrom netwthe ork previous mappingoutput a variable-length sequence to Fig. 10.10: 10.3, this RNN con contains tains connections to the current state. x y of v alues into a distribution ov er sequences of v alues of the same length. Compared These connections allow this RNN to mo model del an arbitrary distribution ov over er sequences of y to Fig. 10.3, thisofRNN from theofprevious output to able the current state. giv given en sequences x ofcon thetains sameconnections length. The RNN Fig. 10.3 is only to represent These connections allowthe this RNN moconditionally del an arbitrary distribution over sequences of y distributions in which aluestoare indep independen enden endentt from each other given yv x given the x sequences values. of of the same length. The RNN of Fig. 10.3 is only able to represent distributions in which the y values are conditionally independent from each other given the x values.
394
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.11: Computation of a typical bidirectional recurren recurrentt neural net netw work, meant to learn to map input sequences x to target sequences y , with loss L at each step t. Figure 10.11: Computation a typical bidirectional recurren tards neural wt)ork, meant h recurrence The propagatesofinformation forward in time (tow (towards the net righ right) while the x tobac y , with L atThus t. to learn to map input sequences target sequences eachatstep g recurrence propagates information backward kward in time (tow (towards ards loss the left). each The information forward in an time (towardsofthe t) in while poin ointtht,recurrence the outputpropagates units o can benefit from a relev relevan ant t summary therigh past its hthe g recurrence propagates kward inin time ards the left). Thus at each input and from a relev relevan an anttinformation summary ofbac the future its g(towinput. point t, the output units o can benefit from a relevant summary of the past in its h input and from a relevant summary of the future in its g input.
395
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.3
Bidirectional RNNs
All of theBidirectional recurren recurrentt netw networks orks RNNs we hav havee considered up to now ha have ve a “causal” struc10.3 ture, meaning that the state at time t only captures information from the past, All t netw wte input have considered to now have “causal” struc(t−1) x(1)of , . .the . , xrecurren x(t). Some up , and the orks presen present of the mo models dels wea ha have ve discussed t ture, meaning that the state at time only captures information from the past, y also allow information from past y values to affect the current state when the x , . . . , x x , and the presen t input . Some of the mo dels we ha ve discussed values are av available. ailable. also allow information from past y values to affect the current state when the y Ho Howev wev wever, in many applications we wan wantt to output a prediction of y (t) whic which h values are er, available. ma may y dep depend end on . For example, in sp speech eech recognition, y whic Ho wev er, in many applications we wan t to output a prediction of end h the correct interpretation of the current sound as a phoneme ma may y dep depend on the may dep on For pexample, spyeech next few end phonemes because of co-articulation .and oten otentially tiallyinma may evenrecognition, dep depend end on the correct interpretation of the current sound as a phoneme ma y dep end on the the next few words because of the linguistic dep dependencies endencies betw etween een nearby words: if next few phonemes b ecause of co-articulation and p oten tially ma y even dep end on there are tw twoo interpretations of the curren currentt word that are both acoustically plausible, the next few w ords b ecause of the linguistic dependencies een nearby words: if we ma may y hav havee to lo look ok far into the future (and the past)betw to disambiguate them. thereisare twotrue interpretations of the current word areother both sequence-to-sequence acoustically plausible, This also of handwriting recognition andthat many w e may hav e todescrib look ed far in into (and the past) to disambiguate them. learning tasks, described thethe nextfuture section. This is also true of handwriting recognition and many other sequence-to-sequence Bidirectional recurren recurrentt neural netw networks orks (or bidirectional RNNs) were inv invented ented learning tasks, described in the next section. to address that need (Sch and Paliw , 1997 ). They ha b een extremely sucSchuster uster Paliwal al have ve Bidirectional recurren t neural netw orks (or bidirectional RNNs) w ere inv ented cessful (Gra Graves ves, 2012) in applications where that need arises, such as handwriting to address that need (Sch Paliw al, 1997 ). They ha, ve been extremely sucrecognition (Grav Graves es et al. al.,uster , 2008and ; Gra Graves ves and Sc Schmidhuber hmidhuber 2009 ), sp speech eech recognicessful (Gra , 2012 ) in applications where that such as handwriting tion (Gra Grav vesves and Schmidh Schmidhub ub uber er, 2005; Gra Grav ves et al. al.,,need 2013arises, ) and bioinformatics (Baldi recognition ( Grav es et al. , 2008 ; Gra ves and Sc hmidhuber , 2009 ), sp eech recogniet al. al.,, 1999). tion (Graves and Schmidhuber, 2005; Graves et al., 2013) and bioinformatics (Baldi As the name suggests, bidirectional RNNs combine an RNN that mo mov ves forw forward ard et al., 1999). through time beginning from the start of the sequence with another RNN that As bac the name suggests, bidirectional RNNs that moves forw ard mo moves ves backward kward through time beginning fromcombine the endan of RNN the sequence. Fig. 10.11 through time from the start of with the sequence withfor another RNN h (t) standing illustrates the bteginning ypical bidirectional RNN, the state of that the moves bacthat kward through time through beginning from the of the sequence. Fig.of10.11 t) g (end sub-RNN mov moves es forward time and standing for the state the h illustrates the t ypical bidirectional RNN, with standing for the state of the ( t ) sub-RNN that mov moves es bac backward kward through time. This allows the output units o to sub-RNN that mov es forward the compute a representation thatthrough dep depends endstime on and g standing for the state of but o to sub-RNN that mov bacinput kwardvthrough time. time Thistallows the having output to units is most sensitive toesthe alues around , without sp specify ecify a compute awindo representation depw ends but fixed-size window w around t that (as one ouldon hav havee to do with a feedforward netw network, ork, most sensitivenet towork, the input values around time , withoutlo having to sp ecify a ais con convolutional volutional network, or a regular RNN with a tfixed-size look-ahead ok-ahead buffer). t fixed-size window around (as one would have to do with a feedforward network, This idea can e naturally extended to with 2-dimensional such asbuffer). images, a con volutional netbwork, or a regular RNN a fixed-sizeinput, look-ahead by having RNNs, each one going in one of the four directions: up, do down, wn, naturally to 2-dimensional such compute as images, left,This righ right. t.idea At can eachbpeoint (i, j) ofextended a 2-D grid, an output O i,j input, could then a b y having RNNs, each one going in one of the four directions: up, do wn, represen representation tation that would capture mostly lo local cal information but could also dep depend end left, righ t. At each p oint ( ) of a 2-D grid, an output could then compute a i, j O on long-range inputs, if the RNN is able to learn to carry that information. representation would capture mostly localapplied information but are could also dep end Compared to athat conv convolutional olutional netw network, ork, RNNs to images typically more on long-range inputs, if the RNN is able to learn to carry that information. Compared to a convolutional network, RNNs applied to images are typically more 396
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
exp expensiv ensiv ensivee but allow for long-range lateral in interactions teractions betw between een features in the same feature map (Visin et al., 2015; Kalc Kalchbrenner hbrenner et al. al.,, 2015). Indeed, the exp ensiv e but allow equations for long-range lateral teractions betwineen features the forw forward ard propagation for such RNNsinmay be written a form that in shows same use feature map (Visin et computes al., 2015; the Kalcbhbrenner al., 2015 ). Indeed, the they a con conv volution that ottom-upetinput to each lay layer, er, prior forw ard propagation equations for such RNNs may b e written in a form that shows to the recurren recurrentt propagation across the feature map that incorp incorporates orates the lateral they use a convolution that computes the bottom-up input to each layer, prior in interactions. teractions. to the recurrent propagation across the feature map that incorporates the lateral interactions.
10.4
Enco Encoder-Deco der-Deco der-Decoder der Sequence-to-Sequence Arc Architechitectures 10.4 Encoder-Deco der Sequence-to-Sequence ArchitecWe ha have ve seen in Fig. 10.5 how an RNN can map an input sequence to a fixed-size tures vector. We hav havee seen in Fig. 10.9 how an RNN can map a fixed-size vector to a W e ha ve seen Fig. 10.5inhow RNN can10.4 map an input a fixed-size sequence. We in hav have e seen Fig.an10.3 , Fig. , Fig. 10.10sequence and Fig. to 10.11 how an vRNN ector.can Wemap havan e seen in Fig. 10.9 how an RNN can map a fixed-size vector input sequence to an output sequence of the same length. to a sequence. We have seen in Fig. 10.3, Fig. 10.4, Fig. 10.10 and Fig. 10.11 how an Here we discuss how an RNN can be trained to map an input sequence to an RNN can map an input sequence to an output sequence of the same length. output sequence whic which h is not necessarily of the same length. This comes up in Here we discuss how be trained machine to map an input sequence to an man many y applications, such an as RNN speechcan recognition, translation or question output sequence whic h is not necessarily of the same length. This comes up in answ answering, ering, where the input and output sequences in the training set are generally manof y applications, such as speech recognition, machine translation or question not the same length (although their lengths might be related). answering, where the input and output sequences in the training set are generally We often call the input to the RNN the “con “context.” text.” We wan antt to pro produce duce a not of the same length (although their lengths might be related). represen representation tation of this context, C . The context C migh mightt be a vector or sequence of W e often call the input to the RNN the “con (1) , . . . ,W xe(n w)an vectors that summarize the input sequence X = (xtext.” ). t to produce a representation of this context, C . The context C might be a vector or sequence of The simplest RNN architecture for mapping a variable-length sequence to anvectors that summarize the input sequence X = (x , . . . , x ). other variable-length sequence was first prop proposed osed by Cho et al. (2014a) and shortly simplest mapping variable-length sequence toand anafterThe by Sutsk Sutskev ev ever erRNN et al.architecture (2014), who for indep independen enden endently tlya dev develop elop eloped ed that arc architecture hitecture other v ariable-length sequence was first prop osed b y Cho et al. ( 2014a ) and shortly were the first to obtain state-of-the-art translation using this approach. The former after by isSutsk everonetscoring al. (2014proposals ), who indep endentlyby dev eloped that architecture and system based generated another machine translation w ere the while first to obtain translation using this system, the latterstate-of-the-art uses a standalone recurrent netw network ork approach. to generateThe theformer transsystem is based on scoring proposals generated by another machine translation lations. These authors resp respectively ectively called this arc architecture, hitecture, illustrated in Fig. 10.12, system, while the latter uses a standalone recurrent network to idea generate thesimple: transthe enco encoder-deco der-deco der-decoder der or sequence-to-sequence architecture. The is very lations. These called this architecture, Fig. 10.12 (1) an enc or readerresp or ectively input RNN pro the input illustrated sequence. in The enco enco oder authors processes cesses encoder der, the enco der sequence-to-sequence architecture. Thehidden idea is state. very simple: C ,orusually emits theder-deco context as a simple function of its final (2) a (1) an enc o der or r e ader or input RNN pro cesses the input sequence. The enco dere de deccoder or writer or output RNN is conditioned on that fixed-length vector (just lik like C emits the context , usually as a simple function of its final hidden state. (2) y(1), . . . , y (n )). The inno in Fig. 10.9) to generate the output sequence Y = ((y innov vationa decthis oderkind or writer or output RNN is conditioned fixed-length (just likise of of architecture over those presentedon inthat earlier sections ofvector this chapter Y = (y while , . . . , yprevious in Fig. ) to generate output ). The inno vation n x and the ny can that the10.9 lengths varysequence from each other, arc architectures hitectures of this kind of over those presented in earlier sections ofthe thistw chapter is nxarchitecture = ny = τ . In constrained a sequence-to-sequence architecture, two o RNNs that the lengths n and n can vary from each other, while previous architectures constrained n = n = τ . In a sequence-to-sequence architecture, the two RNNs 397
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.12: Example of an enco encoder-deco der-deco der-decoder der or sequence-to-sequence RNN architecture, for learning to generate an output sequence ( ) giv given en an input sequence ,..., Figure 10.12: Example of an enco der-deco der or sequence-to-sequence RNN architecture, ). It is comp composed osed of an enco encoder der RNN that reads the input sequence ( , ,..., ( ) , . . . , for learning generate an outputthe sequence given the an input sequence and a deco decoder dertoRNN that generates output sequence (or computes probabilit probability y of a ). It The (giv , . . . sequence). , is comp of an encoofder thatRNN readsisthe input sequencea given en, output finalosed hidden state theRNN enco encoder der used to compute and a decofixed-size der RNNcon that generates output sequence a(or computes the probabilit of a C whic generally context text variablethe which h represents semantic summary of the yinput given output final hidden state of the encoder RNN is used to compute a sequence and sequence). is given as The input to the deco decoder der RNN. generally fixed-size context variable C which represents a semantic summary of the input sequence and is given as input to the decoder RNN.
398
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
are trained join jointly tly to maximize the av average erage of log P (y (1) , . . . , y(n ) | x (1), . . . , x (n )) over all the pairs of x and y sequences in the training set. The last state hn of . . ,of y the input ..,x ) x , . sequence are trained tly to avas erage of log P (y , .C the encoderjoin RNN is maximize typically the used a representation x y h over is allprovided the pairsasofinput andto the sequences the training set. The last state of | that deco decoder derinRNN. the encoder RNN is typically used as a representation C of the input sequence If the context C is a vector, then the deco decoder der RNN is simply a vector-tothat is provided as input to the decoder RNN. sequence RNN as describ described ed in Sec. 10.2.4. As we ha have ve seen, there are at least tw two o C If the context is a vector, then the deco der RNN is simply a vector-toways for a vector-to-sequence RNN to receive input. The input can be provided as sequence ed inorSec. As bwee connected have seen, to there at least o the initialRNN stateas of describ the RNN, the 10.2.4 input .can the are hidden unitstwat w ays for a vector-to-sequence RNN to receive input. The input can b e provided as eac each h time step. These tw two o wa ways ys can also be combined. the initial state of the RNN, or the input can be connected to the hidden units at no constraint thecan enco encoder derbmust hav havee the same size of hidden lay layer er eachThere time isstep. These twothat ways also e combined. as the deco decoder. der. There is no constraint that the encoder must have the same size of hidden layer One clear limitation of this architecture is when the context C output by the as the decoder. enco encoder der RNN has a dimension that is too small to properly summarize a long C)output One of this when theetcontext by the sequence.clear Thislimitation phenomenon wasarchitecture observ observed ed by is Bahdanau al. (2015 in the context enco der RNN has a dimension that small properly summarize long of mac machine hine translation. They prop proposed osedistotoo mak make e C a to variable-length sequence arather sequence. This phenomenon was observ ed by Bahdanau et al. ( 2015 ) in the context than a fixed-size vector. A Additionally dditionally dditionally,, they in intro tro troduced duced an attention me mechanism chanism C of mac hine translation. They prop osed to mak e a v ariable-length sequence rather C that learns to asso associate ciate elemen elements ts of the sequence to elemen elements ts of the output than a fixed-size , they introduced an attention mechanism sequence. See Sec.vector. 12.4.5.1Additionally for more details. that learns to associate elements of the sequence C to elements of the output sequence. See Sec. 12.4.5.1 for more details.
10.5
Deep Recurren Recurrentt Net Networks works
The computation most RNNstcan be decomp decomposed bloccks of parameters 10.5 Deep inRecurren Net worksosed into three blo and asso associated ciated transformations: The computation in most RNNs can be decomposed into three blocks of parameters and1.asso ciated transformations: from the input to the hidden state, 2. hidden statestate, to the next hidden state, and 1. from the previous input to the hidden 3. to the output. 2. from the hidden previousstate hidden state to the next hidden state, and 3. the fromRNN the hidden state to With arc architecture hitecture of the Fig.output. 10.3, each of these three blo bloccks is asso associated ciated with a single weigh weightt matrix. In other words, when the net network work is unfolded, each With thecorresponds RNN architecture of Fig.transformation. 10.3, each of these blocks is associated of these to a shallow By three a shallow transformation, with a single weight matrix.that In other words, when the net is unfolded, each w e mean a transformation would be represented bywork a single la lay yer within of these corresponds to a shallow transformation. By a shallow transformation, a deep MLP MLP.. Typically this is a transformation represented by a learned affine w e mean a transformation be represented by a single layer within transformation follow followed ed by athat fixedwould nonlinearit nonlinearity y. a deep MLP. Typically this is a transformation represented by a learned affine Would it be adv advan an antageous tageous to introduce depth in each of these operations? transformation followed by a fixed nonlinearity. Exp Experimen erimen erimental tal evidence (Grav Graves es et al. al.,, 2013; Pascan ascanu u et al. al.,, 2014a) strongly suggests W ould it be adv an tageous to introduce depth in of these operations? so. The exp experimen erimen erimental tal evidence is in agreemen agreementt with the each idea that we need enough Experimental evidence (Graves et al., 2013; Pascanu et al., 2014a) strongly suggests 399 so. The experimental evidence is in agreemen t with the idea that we need enough
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.13: A recurrent neural netw network ork can b e made deep in man many y ways (Pascan Pascanu u , 2014a). The hidden recurrent state can be brok broken en down in into to groups organized Figure 10.13:. A recurrent neural netw(e.g., ork can b e made in many w Pascanu hierarc hierarchically hically hically. Deep Deeper er computation an MLP) can deep be introduced inays the(input-to, 2014a ). The hidden recurrent state can b e brok en down in to groups organized hidden, hidden-to-hidden and hidden-to-output parts. This ma may y lengthen the shortest hierarc hically.different Deeptime er computation (e.g.,path-lengthening an MLP) can beeffect introduced the input-topath linking steps. The can b einmitigated by hidden, hidden-to-hidden and hidden-to-output parts. This may lengthen the shortest in intro tro troducing ducing skip connections. path linking different time steps. The path-lengthening effect can b e mitigated by introducing skip connections.
400
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
depth in order to perform the required mappings. See also Schmidh Schmidhub ub uber er (1992), El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs. depth in order to perform the required mappings. See also Schmidhuber (1992), Gra Graves ves et al. (2013) were the first to show a significant benefit of decomp decomposing osing El Hihi and Bengio (1996), or Jaeger (2007a) for earlier work on deep RNNs. the state of an RNN in into to multiple la lay yers as in Fig. 10.13 (left). We can think Gra ves et al. ( 2013 ) w ere the first to show ainsignificant ofying decomp osing of the low lower er la lay yers in the hierarc hierarch hy depicted Fig. 10.13baenefit as pla playing a role in the state of an to multiple layers as in Fig. 10.13 (left).appropriate, We can think transforming theRNN ra raw w in input in into to a representation that is more at of the lowerlev laels yersofinthe thehidden hierarcstate. hy depicted in Fig. a as plaaying role in the higher levels Pascan ascanu u et al.10.13 (2014a ) go stepa further transforming raewa separate input into a representation that more appropriate, at and prop propose ose tothe hav have MLP (p (possibly ossibly deep) for is eac each h of the three blo blocks cks the higher lev els of the hidden state. P ascan u et al. ( 2014a ) go a step further en enumerated umerated ab abov ov ove, e, as illustrated in Fig. 10.13b. Considerations of representational and prop ose to hav a cate separate MLP (possibly deep) eacthree h of the three cks capacit capacity y suggest to eallo allocate enough capacity in each of for these steps, butblo doing enumerated ove, as illustrated in Fig.by10.13 b. Considerations of representational so by addingab depth may hurt learning making optimization difficult. In general, capacit y suggest to allo cate enough capacity in each of these three steps, doing it is easier to optimize shallo shallower wer architectures, and adding the extra but depth of so by adding depth may hurt learning by making optimization difficult. In general, Fig. 10.13b makes the shortest path from a variable in time step t to a variable it is easier optimize shallo wer architectures, the aextra of in time steptot + 1 become longer. F For or example, ifand an adding MLP with singledepth hidden Fig. b makes shortest path from a variable time step to a variable la layer yer 10.13 is used for thethe state-to-state transition, we hav haveeindoubled thet length of the in time step + 1 b ecome longer. F or example, if an MLP with a single hidden t shortest path b et etween ween variables in any tw twoo different time steps, compared with the la yer is used for the state-to-state transition, we bhav e doubled of this the ordinary RNN of Fig. 10.3. Ho How wever, as argued y Pascan Pascanu u et the al. length (2014a), shortest path b etween variables anyconnections two different steps, compared with can be mitigated by in intro tro troducing ducinginskip in time the hidden-to-hidden path,the as ordinary RNN of Fig. 10.3 . Ho w ever, as argued b y Pascan u et al. ( 2014a ), this illustrated in Fig. 10.13c. can be mitigated by introducing skip connections in the hidden-to-hidden path, as illustrated in Fig. 10.13c.
10.6
Recursiv Recursive e Neural Net Networks works
Recursiv Recursive neural net networks works represen represent t yetworks another generalization of recurrent net10.6 eRecursiv e Neural Net works, with a different kind of computational graph, which is structured as a deep Recursiv e neural represen t yet another generalization recurrent nettree, rather than net theworks chain-lik chain-like e structure of RNNs. The typicalofcomputational w orks, with a different kind of computational graph, which is structured aworks deep graph for a recursive net network work is illustrated in Fig. 10.14. Recursive neuralas net networks tree, rather thanbythe chain-lik e structure of pRNNs. ypical computational w ere in intro tro troduced duced Pollack (1990 ) and their otentialThe use tfor learning to reason graph for a recursive net work is illustrated in Fig. 10.14 . Recursive neural networks was described by by Bottou (2011). Recursiv Recursivee net networks works ha have ve been successfully w ere intro by Pollack (1990) and their potential use for learning to reason applied toduced pro processing cessing as input to neural nets (Frasconi et al. al.,, w as described by by Bottou ( 2011 ). Recursiv e net works ha ve been successfully 1997, 1998), in natural language pro processing cessing (So Socher cher et al. al.,, 2011a,c, 2013a) as well applied to processing as).input to neural nets (Frasconi et al., as in computer vision (So Socher cher et al., 2011b 1997, 1998), in natural language processing (Socher et al., 2011a,c, 2013a) as well clear adv advantage antage ofcher recursive over recurrentt nets is that for a sequence as inOne computer vision (So et al.,nets 2011b ). recurren of the same length τ , the depth (measured as the num numb ber of comp compositions ositions of One clear adv antage of recursive nets o ver recurren t nets is that for a sequence nonlinear op operations) erations) can be drastically reduced from τ to O ( log τ ), whic which h might τ , the of the same depth (measured as question the numb of to comp of help deal withlength long-term dependencies. An op open en iserhow bestositions structure τes O ( log τend nonlinear op erations) be edrastically reduced fromdo tonot ), whic h might the tree. One option iscan to hav have a tree structure which does dep depend on the data, help deal with long-term dependencies. An open question is how to best structure suggest not abbreviate neural network” avoid with the We tree. One to option is to have“recursive a tree structure whichasdo“RNN” es nottodep endconfusion on the data, “recurrent neural network.”
401
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.14: A recursive netw network ork has a computational graph that generalizes that of the recurren recurrentt netw network ork from a chain to a tree. A variable-size sequence x , x , . . . , x can Figure 10.14: ork has a computational generalizes that of the b e mapp mapped ed toAa recursive fixed-sizenetw representation (the output ograph ), withthat a fixed set of parameters x , x , . . . , in x which recurren t netw ork from a tree. variable-size sequence can U , aV c,hain W ).toThe (the weigh weight t matrices figureAillustrates a sup supervised ervised learning case o be mapp ed to a fixed-size representation (the with output ), withsequence. a fixed set of parameters some target y is provided which is asso associated ciated the whole (the weight matrices U , V , W ). The figure illustrates a supervised learning case in which some target y is provided which is associated with the whole sequence.
402
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
suc such h as a balanced binary tree. In some application domains, external metho methods ds can suggest the appropriate tree structure. For example, when pro processing cessing natural such as a sentences, balanced binary tree. In some domains, external language the tree structure forapplication the recursive netw network ork can bemetho fixed ds to can suggest the appropriate tree structure. F or example, when pro cessing natural the structure of the parse tree of the sentence provided by a natural language language sentences, tree, 2013a structure for the recursive ork learner can be itself fixed to to parser (So Socher cher et al.the , 2011a ). Ideally Ideally, , one would netw like the the structure of the of thethat sentence provided for by an a y natural language disco discover ver and infer theparse tree tree structure is appropriate any giv given en input, as parser ( So cher et al. , 2011a , 2013a ). Ideally , one w ould like the learner itself to suggested by Bottou (2011). discover and infer the tree structure that is appropriate for any given input, as Man Many y varian ariants ts of the recursiv recursivee net idea are possible. For example, Frasconi suggested by Bottou (2011). et al. (1997) and Frasconi et al. (1998) asso associate ciate the data with a tree structure, yciate varian ts of the recursiv e net with idea are possible.nodes For example, Frasconi andMan asso associate the inputs and targets individual of the tree. The et al. ( 1997 ) and F rasconi et al. ( 1998 ) asso ciate the data with a tree structure, computation performed by eac each h no node de do does es not hav havee to be the traditional artificial and asso ciate the inputs and targets with individual of by thea tree. The neuron computation (affine transformation of all inputs nodes follow followed ed monotone computation p erformed by eac h no de do es not hav e to b e the traditional artificial nonlinearit nonlinearity). y). For example, So Socher cher et al. (2013a) prop propose ose using tensor op operations erations neuron computation (affine transformation of found all inputs follow eddel by relationships a monotone and bilinear forms, which ha have ve previously been useful to mo model nonlinearit y). For(example, cher et ;al. (2013a prop ose )using operations b et etween ween concepts Weston etSoal. al., , 2010 Bordes et) al. al., , 2012 whentensor the concepts are and bilinear forms, which ha ve previously b een found useful to mo del relationships represen represented ted by con continuous tinuous vectors (embeddings). between concepts (Weston et al., 2010; Bordes et al., 2012) when the concepts are represented by continuous vectors (embeddings).
10.7
The Challenge of Long-T Long-Term erm Dep Dependencies endencies
The mathematical challenge of of learning long-term dependencies endencies in recurrent net10.7 The Challenge Long-T ermdep Dep endencies works was in intro tro troduced duced in Sec. 8.2.5. The basic problem is that gradients propagated The challenge of learning long-term endencies in de recurrent over mathematical man many y stages tend to either vanish (most of thedep time) or explo explode (rarely (rarely,, netbut w orks was in tro duced in Sec. 8.2.5 . The basic problem is that gradients propagated with muc much h damage to the optimization). Ev Even en if we assume that the parameters are o ver man y stages tend to either v anish (most the memories, time) or explo (rarely, but suc such h that the recurrent netw network ork is stable (can of store withdegradients not with muc h damage to the optimization). Ev en if we assume that the parameters are explo exploding), ding), the difficulty with long-term dep dependencies endencies arises from the exp exponentially onentially such that the ts recurrent ork is stable (can store memories, gradients not smaller weigh giv long-term interactions (inv multiplication of eights given en tonetw (involving olving thewith explo ding), the difficulty with long-term dep endencies arises from the exp onentially man many y Jacobians) compared to short-term ones. Man Many y other sources provide a smaller w eigh ts giv en to long-term interactions (inv olving multiplication of, deep deeper er treatment (Ho Hochreiter chreiter, 1991; Doy Doyaa, 1993; Bengio et al. al.,,the 1994 ; Pascan Pascanu u et al. al., man y Jacobians) compared to short-term ones. Man y other sources provide 2013a 2013a)) . In this section, we describ describee the problem in more detail. The remaininga deeper treatment (Hochreiter,to 1991 ; Doya, 1993 et al., 1994; Pascanu et al., sections describ describee approaches overcoming the; Bengio problem. 2013a) . In this section, we describe the problem in more detail. The remaining Recurren Recurrent t netw networks orks in inv volve composition of the same function multiple sections describ e approaches to othe vercoming the problem. times, once per time step. These comp compositions ositions can result in extremely nonlinear Recurren t netw orks in v olve the composition of the same function multiple beha ehavior, vior, as illustrated in Fig. 10.15. times, once per time step. These compositions can result in extremely nonlinear In particular, the function composition employ employed ed by recurrent neural netw networks orks behavior, as illustrated in Fig. 10.15. somewhat resembles matrix multiplication. We can think of the recurrence relation In particular, the function composition employed by recurrent neural networks (t−1) h(t) = W >h somewhat resembles matrix multiplication. We can think of the recurrence (10.29) relation as a very simple recurrent neural network a nonlinear activ activation ation function, h netw = ork W lacking h (10.29) 403 lacking a nonlinear activation function, as a very simple recurrent neural network
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.15: When comp composing osing many nonlinear functions (like the linearlinear-tanh layer yer shown tanh la here), the result is highly nonlinear, typically with most of the values associated with a tiny tanh Figure 10.15: When comp osing manyderiv nonlinear functions the linearlayer shown deriv derivative, ative, some values with a large derivativ ativ ative, e, and many(like alternations bet etween ween increasing here), the result is nonlinear, most of values associatedhidden with astate tiny and decreasing. Inhighly this plot, we plottypically a linear with pro projection jection ofthe a 100-dimensional deriv alues with aplotted large deriv ative, and many etween increasing do down wnative, to a some singlevdimension, on the The xalternations -axis is thebco coordinate ordinate of the y-axis. and decreasing. plot,direction we plot ainlinear pro jection of a 100-dimensional hidden initial state alongInathis random the 100-dimensional space. We can th thus us viewstate this y x downastoa alinear single dimension, of plotted on the -axis.function. The -axis the show co ordinate of the plot cross-section a high-dimensional Theisplots the function initialeach statetime along a random direction the each 100-dimensional space.the Wetransition can thus view this after step, or equiv equivalently alently alently,, in after num number ber of times function plot as a linear cross-section of a high-dimensional function. The plots show the function has been comp composed. osed. after each time step, or equivalently, after each number of times the transition function has been composed.
404
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
and lacking inputs x. As describ described ed in Sec. 8.2.5, this recurrence relation essen essentially tially describ describes es the p ower metho method. d. It may be simplified to and lacking inputs x. As described in Sec. 8.2.5, this recurrence relation essentially (t) t > (0) to describes the p ower method. Ithmay =beWsimplified h , (10.30) h osition = W and if W admits an eigendecomp eigendecomposition
h
,
(10.30)
and if W admits an eigendecompW osition = Q Q>,
(10.31)
= QtoQ , the recurrence may be simplified W further
(10.31)
) the recurrence may be simplified h (tfurther = Q > tot Qh(0).
(10.32)
h p= Q of tQh . eigenv (10.32) The eigenv eigenvalues alues are raised to the ower causing eigenvalues alues with magnitude less than one to decay to zero and eigenv eigenvalues alues with magnitude greater than one to The eigenv alues are raised to the p o wer of t aligned causing with eigenv alues witheigenv magnitude (0) explo explode. de. Any comp component onent of h that is not the largest eigenvector ector less than one to decay to zero and eigenv alues with magnitude greater than one to will even eventually tually be discarded. explode. Any component of h that is not aligned with the largest eigenvector problem particular to recurrent netw networks. orks. In the scalar case, imagine will This eventually be is discarded. multiplying a weigh weightt w by itself many times. The pro product duct wt will either vanish or This problem particular to recurrent orks. themake scalar case, imaginet explo explode de dep depending endingis on the magnitude of w. netw How Howev ev ever, er, ifInwe a non-recurren non-recurrent w w m ultiplying a weigh t b y itself many times. The pro duct will either anish or net netwo wo work rk that has a different weigh weightt w (t) at eac each h time step, the situation isvdifferent. w. How explo depending thebymagnitude ever, we make non-recurren (t). Supp t isifgiven If the de initial state is on given 1, then theofstate at time by at w Suppose oset w at eac netwothe rk that a different weightrandomly h time step, the situation is different. that alues are generated randomly, , indep independently endently from one another, with w(t) vhas t w If the initial state is given b y 1 , then the state at time is given b y . Supp ose n zero mean and variance v . The variance of the pro product duct is O (v ). To obtain some √ that the v alues are generated randomly , indep endently from one another, with w desired variance v ∗ we may cho hoose ose the individual weights with variance v = v ∗. v zero mean and v ariance . The ariance of the pro ductscaling is O (v can ). Tth o us obtain Very deep feedforw feedforward ard net networks worksvwith carefully chosen thus av avoid oidsome the v w variance e may chooseproblem, the individual weights with variance vdesired anishing and explo exploding ding gradient as argued by Sussillo (2014).v = √ v . Very deep feedforward networks with carefully chosen scaling can thus avoid the The vanishing exploding ding problem, gradien gradientt problem was indep independently vanishing and exploand dingexplo gradient as arguedforbyRNNs Sussillo (2014 ). endently disco discovered vered by separate researchers (Ho Hochreiter chreiter, 1991; Bengio et al., 1993, 1994). The v anishing and explo ding gradien t problem for RNNs was endently One may hop hopee that the problem can be av avoided oided simply by sta staying yingindep in a region of discovered space by separate researchers , 1991 Bengio al., 1993, 1994 parameter where the gradients(Ho dochreiter not vanish or; explo explode. de.etUnfortunately Unfortunately, , in). One may hopememories that the problem e av oidedtosimply staying in a the region of order to store in a wa way y can thatbis robust small by perturbations, RNN parameter space where the gradients do not v anish or explo de. Unfortunately , in must enter a region of parameter space where gradients vanish (Bengio et al., 1993, order). to store memories in athe waymo that robust small plong erturbations, the RNN 1994 Sp is able to to represent term dep 1994). Specifically ecifically ecifically, , whenever model del is dependencies, endencies, mustgradien enter at of region of parameter space where gradients vanish (Bengio et al., 1993 the gradient a long term interaction has exp exponen onen onentially tially smaller magnitude than, 1994 ). Specifically whenever mo del is able to represent long term endencies, the gradient of a ,short termthe in interaction. teraction. It does not mean that it dep is impossible thelearn, gradien t of a long termtake interaction has exp onen magnitude than to but that it might a very long time to tially learn smaller long-term dep dependencies, endencies, the gradient of a short term in teraction. It does not mean that it is impossible because the signal ab about out these dep dependencies endencies will tend to be hidden by the smallest to learn, but arising that it from mightshort-term take a verydep long time to learn long-term endencies, fluctuations dependencies. endencies. In practice, thedep exp experiments eriments b ecause the signal ab out these dep endencies will tend to b e hidden by the smallest in Bengio et al. (1994) show that as we increase the span of the dependencies that fluctuations arising from short-term dependencies. In practice, the experiments in Bengio et al. (1994) show that as we 405 increase the span of the dependencies that
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
need to be captured, gradient-based optimization becomes increasingly difficult, with the probability of successful training of a traditional RNN via SGD rapidly need to b0e for captured, gradient-based optimization reac reaching hing sequences of only length 10 or 20. becomes increasingly difficult, with the probability of successful training of a traditional RNN via SGD rapidly or a 0deep deeper treatment of thelength dynamical view of recurrent netw networks, orks, reacFhing for er sequences of only 10 or systems 20. see Doy Doya a (1993), Bengio et al. (1994) and Siegelmann and Sontag (1995), with a For in a deep er treatment of the dynamical systems viewofofthis recurrent netw orks, review Pascan Pascanu u et al. (2013a ). The remaining sections chapter discuss Doyaapproac (1993),hes Bengio et al. 1994proposed ) and Siegelmann (1995 with a vsee arious approaches that hav have e b(een to reduceand theSontag difficult difficulty y of ), learning review in Pascan u et al. (in (2013a ). The sections chapter discuss long-term dependencies some casesremaining allo allowing wing an RNN of to this learn dep dependencies endencies vacross arious happroac hes that hav e b een proposed to reduce the difficult y of learning undreds of steps), but the problem of learning long-term dep dependencies endencies long-term some cases allolearning. wing an RNN to learn dependencies remains onedependencies of the main (in challenges in deep across hundreds of steps), but the problem of learning long-term dependencies remains one of the main challenges in deep learning.
10.8
Ec Echo ho State Net Networks works
h(t−1) to h(t) and the input weigh The recurrent weigh weights ts mapping weights ts mapping 10.8 Echo State Netfrom works from x(t) to h(t) are some of the most difficult parameters to learn in a recurrent h al.and The recurrent ts mapping to et the; Jaeger input weigh mapping net network. work. One weigh prop proposed osed (Jaegerfrom , 2003h; Maass , 2002 and ts Haas , 2004; from x to h are some of the most difficult parameters to learn in a recurrent Jaeger, 2007b) approach to av avoiding oiding this difficulty is to set the recurren recurrentt weigh weights ts net work. One prop osed ( Jaeger , 2003 ; Maass et al. , 2002 ; Jaeger and Haas , 2004 suc such h that the recurren recurrentt hidden units do a go goo od job of capturing the history of; Jaeger , 2007band ) approach to avoiding this difficulty is. toThis set the recurren t weigh ts past inputs, is the idea that was suc h that the hidden a goodorjob of capturing theHaas history of; indep independen enden endently tly recurren prop proposed osedt for echo units state do networks ESNs (Jaeger and , 2004 past inputs, and . This is the idea that w as Jaeger, 2007b) and liquid state machines (Maass et al. al.,, 2002). The latter is similar, indep enden tly prop osed for e cho state networks or ESNsinstead (Jaegerofand , 2004 except that it uses spiking neurons (with binary outputs) the Haas contin continuousuous-; Jaeger 2007b) and liquid (Maass et al. , 2002 ). The is similar, v alued, hidden units usedstate for machines ESNs. Both ESNs and liquid statelatter machines are except that it uses spiking neurons (with binary outputs) instead of the contin uoustermed reservoir computing (Luk Lukoševičius oševičius and Jaeger, 2009) to denote the fact vthat alued hidden units used for ESNs. Both ESNs and liquid which state machines are the hidden units form of reserv reservoir oir of temp temporal oral features may capture termed r eservoir c omputing ( Luk oševičius and Jaeger , 2009 ) to denote the fact differen differentt asp aspects ects of the history of inputs. that the hidden units form of reservoir of temporal features which may capture One wa way y to think ab about out these reservoir computing recurrent netw networks orks is that different aspects of the history of inputs. they are similar to kernel mac machines: hines: they map an arbitrary length sequence (the One wa y to think ab out these computing netwtorks (t) ), history of inputs up to time t ) intoreservoir a fixed-length vectorrecurrent (the recurren recurrent stateishthat they are similar to kernel mac hines: they map an arbitrary length sequence (the on which a linear predictor (t (typically ypically a linear regression) can be applied to solv solvee h b),e history of inputs up to time into a fixed-length vector recurren t state to the problem of interest. Thet )training criterion may then(the be easily designed on which predictor ypically a linear be applied to solve con convex vex asaalinear function of the (t output weigh weights. ts. Fregression) or example,can if the output consists thelinear problem of interest. trainingunits criterion thentargets, be easily designed to be of regression from The the hidden to themay output and the training con vex as a function of the output weigh ts. F or example, if the output consists criterion is mean squared error, then it is conv convex ex and ma may y be solved reliably with of linearlearning regression from the(Jaeger hidden units). to the output targets, and the training simple algorithms , 2003 criterion is mean squared error, then it is convex and may be solved reliably with The imp important ortant question is therefore: ho how w do we set the input and recurrent simple learning algorithms (Jaeger, 2003). weights so that a rich set of histories can be represented in the recurren recurrentt neural The imp ortant question is therefore: ho w do we set the input and recurrent net network work state? The answer prop proposed osed in the reservoir computing literature is to weights so that a rich set of histories can be represented in the recurrent neural 406the reservoir computing literature is to network state? The answer proposed in
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
view the recurren recurrentt net as a dynamical system, and set the input and recurrent weights suc such h that the dynamical system is near the edge of stabilit stability y. view the recurrent net as a dynamical system, and set the input and recurrent The original idea was to make the eigen eigenv values of the Jacobian of the state-toweights such that the dynamical system is near the edge of stability. state transition function be close to 1. As explained in Sec. 8.2.5, an imp importan ortan ortantt The original idea w as to make the eigen v alues of the Jacobian of the state-tocharacteristic of a recurren recurrentt netw network ork is the eigenv eigenvalue alue sp spectrum ectrum of the Jacobians state transition function be close to 1. As explained in Sec. 8.2.5 , an important ∂s ( t ) J = ∂s . Of particular imp importance ortance is the sp speectr ctral al radius of J(t) , defined to be characteristic of a recurrent network is the eigenvalue spectrum of the Jacobians the maximum of the absolute values of its eigenv eigenvalues. alues. J = . Of particular importance is the spectral radius of J , defined to be To understand the effect of the sp spectral ectral radius, consider the simple case of the maximum of the absolute values of its eigenvalues. bac back-propagation k-propagation with a Jacobian matrix J that does not change with t . This o understand the effect of the ectral radius, the ose simple of J has caseThapp happens, ens, for example, when the sp net network work is purelyconsider linear. Supp Suppose thatcase J t . asThis baceigenv k-propagation withcorresp a Jacobian that not cwhat hangehapp with λ . does an eigenvector ector v with corresponding ondingmatrix eigenv eigenvalue alue Consider happens ens we J case happ ens, for example, when the net work is purely linear. Supp ose that has propagate a gradien gradientt vector backw backwards ards through time. If we begin with a gradient v an eigenv ector with corresp onding eigenvalue λ . Consider what ensafter as we g, and n vector g , then after one step of back-propagation, we will hav have e J happ propagate a gradien t vector backw ards through time. If we b egin with a gradient n steps we will hav havee J g. No Now w consider what happ happens ens if we instead back-propagate g , then J g,step, n vaector after one back-propagation, willafter haveone and w after perturb perturbed ed version of g.step If wof e begin with g + δv,we then e will J g. nNosteps, steps w consider what happ back-propagate + δhav v). eAfter g +ifδwe v ). instead ha have ve Jw(egwill we will hav have e J n(ens From this we can see g g + δ v a perturb ed v ersion of . If w e begin with , then after one step, ge+will δv that bac back-propagation k-propagation starting from g and bac back-propagation k-propagation starting from w J ( g + δ v n J ( g + δ v ha ve ) . After steps, w e will hav e ) . F rom this w e can see n div diverge erge by δJ v after n steps of bac back-propagation. k-propagation. If v is chosen to be a unit g + δv that bac k-propagation starting from andmultiplication back-propagation starting from simply eigen eigenvector vector of J with eigen eigenv value by the Jacobian λ, gthen δJ v afteratneac divergethe bydifference steps of bac k-propagation. If vofisbac chosen to be a unit scales each h step. The two executions back-propagation k-propagation are eigen vector of with eigen v alue , then multiplication by the Jacobian simply J λ n separated by a distance of δ|λ| . When v corresp corresponds onds to the largest value of |λ| , scales the difference at eac h step. The t wo executions of an bacinitial k-propagation are this perturbation achiev achieves es the widest possible separation of perturbation separated of size δ . by a distance of δ λ . When v corresponds to the largest value of λ , this perturbation achieves the possible separation of an initial perturbation | widest | | | When |λ| > 1, the deviation size δ|λ|n gro grows ws exp exponentially onentially large. When |λ| < 1, of size δ . the deviation size becomes exp exponentially onentially small. When λ > 1, the deviation size δ λ grows exponentially large. When λ < 1, course,size thisbecomes exampleexp assumed that the Jacobian was the same at every the Of deviation onentially | | | | small. | | time step, corresp corresponding onding to a recurren recurrentt netw network ork with no nonlinearit nonlinearity y. When a Of course, example assumed thenonlinearit Jacobianywas same atzero every nonlinearit nonlinearity y is this present, the deriv derivativ ativ ativee that of the nonlinearity willthe approach on time step, corresp onding to a recurren t netw ork with no nonlinearit y . When man many y time steps, and help to preven preventt the explosion resulting from a large sp spectral ectrala nonlinearit y is present, the deriv ativ e of the nonlinearit y will approach zero radius. Indeed, the most recent work on ec echo ho state netw networks orks adv advocates ocates usingon a man y time steps, and help tothan preven t the explosion from ,a2012 large). spectral sp spectral ectral radius muc much h larger unity (Yildiz et al.resulting , 2012; Jaeger radius. Indeed, the most recent work on echo state networks advocates using a Ev Everything erything we hav havee said ab about out back-propagation via rep repeated eated matrix multiplispectral radius much larger than unity (Yildiz et al., 2012; Jaeger, 2012). cation applies equally to forward propagation in a netw network ork with no nonlinearity nonlinearity,, Ev erything we hav e said ab out back-propagation via rep eated matrix multipli( t ) > ( t +1) =h where the state h W. cation applies equally to forward propagation in a network with no nonlinearity, When a linear map=Wh> alw always ays shrinks h as measured by the L 2 norm, then where the state h W. we say that the map is contr ontractive active active.. When the sp spectral ectral radius is less than one, the When a linear map alw ays shrinks as measured the L smaller norm, then W h ( t ) ( t +1) mapping from h to h is contractiv contractive, e, so a small changeby becomes after w e say that the map is c ontr active . When the sp ectral radius is less than one, eac each h time step. This necessarily mak makes es the netw network ork forget information ab about out the the mapping from h to h is contractive, so a small change becomes smaller after each time step. This necessarily makes407 the network forget information about the
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
past when we use a finite lev level el of precision (such as 32 bit integers) to store the state vector. past when we use a finite level of precision (such as 32 bit integers) to store the The Jacobian matrix tells us ho how w a small change of h(t) propagates one step state vector. forw forward, ard, or equiv equivalently alently alently,, how the gradient on h(t+1) propagates one step backw backward, ard, The Jacobian matrix tells us ho w a small change of propagates one step h during back-propagation. Note that neither W nor J need to be symmetric (alforward,they or equiv alently,and howreal), the gradient on hhav propagates one eigenv step backw though are square so they can havee complex-v complex-valued alued eigenvalues alues ard, and W J during back-propagation. Note that neither nor need to b e symmetric (aleigen eigenvectors, vectors, with imaginary comp components onents corresp corresponding onding to potentially oscillatory though they and real),wsoasthey can hav e complex-v alued eigenvalues h(t) orand b eha ehavior vior (if are the square same Jacobian applied iteratively). Ev Even en though a eigen vectors, with imaginary comp onents corresp onding to p otentially oscillatory ( t ) small variation of h of in interest terest in bac back-propagation k-propagation are real-v real-valued, alued, they can h ens beeha vior (if the sameaJacobian walued as applied Even though or to a b expressed in such complex-v complex-valued basis.iteratively). What matters is what happ happens small variation (complex of h of absolute interest in back-propagation are real-v alued, theybasis can the magnitude value) of these possibly complex-v complex-valued alued b eefficien expressed in such complex-v alued basis.byWhat mattersAn is what happ to co coefficien efficients, ts, when weam ultiply the matrix the vector. eigenv eigenvalue alueens with the magnitude (complex absolute v alue) of these p ossibly complex-v alued basis magnitude greater than one corresp corresponds onds to magnification (exp (exponential onential growth, if coefficien ts, when wore shrinking multiply the matrix the, vector. eigenvalue with applied iteratively) (exp (exponential onentialby decay decay, if appliedAn iteratively). magnitude greater than one corresponds to magnification (exponential growth, if Withiteratively) a nonlinear map, the (exp Jacobian is decay free to each step. The applied or shrinking onential , ifchange appliedatiteratively). dynamics therefore become more complicated. How Howev ev ever, er, it remains true that a With a nonlinear map, the Jacobian is free to change each step. One The small initial variation can turn in into to a large variation after at sev several eral steps. dynamics btherefore become ever, itcase remains difference et etween ween the purelymore linearcomplicated. case and theHow nonlinear is thattrue the that use ofa initialnonlinearit variation ycan turn into can a large after dynamics several steps. One asmall squashing nonlinearity such as tanh causevariation the recurrent to become difference etweenthat the it purely linear case andk-propagation the nonlinear case is thatun the use of b ounded. bNote is possible for bac back-propagation to retain unbounded bounded tanh a squashing nonlinearit y such as can cause the recurrent dynamics to b ecome dynamics ev even en when forw forward ard propagation has bounded dynamics, for example, bounded. Note of that it units is possible bacmiddle k-propagation to retain unbounded tanh when a sequence are all for in the of their linear regime and are dynamics ev en when forw ard propagation has b ounded dynamics, for example, connected by weigh weightt matrices with sp spectral ectral radius greater than 1. Ho Howev wev wever, er, it is tanh when a sequence of units are all in the middle of their linear regime and are rare for all of the tanh units to simultaneously lie at their linear activ activation ation point. connected by weight matrices with spectral radius greater than 1. However, it is The strategy of echo state netw networks orks is simply to fix the weigh weights ts to hav havee some rare for all of the tanh units to simultaneously lie at their linear activation point. sp spectral ectral radius such as 3, where information is carried forward through time but state networkseffect is simply to fix thenonlinearities weights to hav e tanh some. tanh. do does esThe notstrategy explo explode de of dueecho to the stabilizing of saturating like spectral radius such as 3, where information is carried forward through time but recently recently, it has b een sho shown wn that theoftechniques to set thelike weigh weights ts. tanh doesMore not explo de ,due to the stabilizing effect saturating used nonlinearities in ESNs could be used to the weights in a fully trainable recurrent net, it has b een sho wn that weigh the techniques set the weights workMore (withrecently the hidden-to-hidden recurrent weights ts trained used usingto back-propagation in ESNs time), could b e used to to learn long-term the wdep eights in a fully trainable recurrent netthrough helping dependencies endencies (Sutsk Sutskev ev ever er, 2012 ; Sutskev Sutskever er w ork (with the hidden-to-hidden recurrent weigh ts trained using back-propagation et al. al.,, 2013). In this setting, an initial spectral radius of 1.2 performs well, combined through to learn long-term deped endencies Sutsk with the time), sparsehelping initialization scheme describ described in Sec. (8.4 . ever, 2012; Sutskever et al., 2013). In this setting, an initial spectral radius of 1.2 performs well, combined with the sparse initialization scheme described in Sec. 8.4.
408
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.9
Leaky Units and Other Strategies for Multiple Time Scales 10.9 Leaky Units and Other Strategies for Multiple One wa way y Time to deal with long-term dep dependencies endencies is to design a mo model del that op operates erates Scales at multiple time scales, so that some parts of the mo model del op operate erate at fine-grained One wa y to deal with long-term dep endencies is to design a mo del at that operates time scales and can handle small details, while other parts op operate erate coarse time at m ultiple time scales, so that some parts of the mo del op erate at fine-grained scales and transfer information from the distant past to the presen presentt more efficiently efficiently.. time scales and can small while other parts operate coarseThese time V arious strategies forhandle building bothdetails, fine and coarse time scales are pat ossible. scales and information from the distant the presen t more include thetransfer addition of skip connections across past time,to“leaky units” that efficiently integrate. V arious strategies for building b oth fine and coarse time scales are p ossible. These signals with differen differentt time constants, and the remo remov val of some of the connections include addition of skiptime connections used to the mo model del fine-grained scales. across time, “leaky units” that integrate signals with different time constants, and the removal of some of the connections used to model fine-grained time scales. One wa way y to obtain coarse time scales is to add direct connections from variables in the distant past to variables in the present. The idea of using such skip connections One wa y totoobtain time scales is to add connections from variables in dates back Lin etcoarse al. (1996 ) and follows fromdirect the idea of incorp incorporating orating dela delays ys in the distant to variables present. The idea of ).using such skip connections feedforw neural netw Lang and Hinton , 1988 In an ordinary recurrent feedforward ardpast networks orksin(the dates back to Lin et al. ( 1996 ) and follows from the idea of incorp orating delatys net network, work, a recurrent connection go goes es from a unit at time t to a unit at time + 1in. feedforw ard neural networks (Lang net andworks Hinton , 1988 ). Indela anys ordinary It is p ossible to construct recurrent networks with longer delays (Bengiorecurrent , 1991). network, a recurrent connection goes from a unit at time t to a unit at time t + 1. As we hav havee seen in Sec. 8.2.5, gradien gradients ts may vanish or explo explode de exp exponentially onentially It is p ossible to construct recurrent networks with longer delays (Bengio, 1991). . Lin et al. (1996) in introduced troduced As we have seen inwith Sec.a8.2.5 , gradien vanish or explo de expGradients onentially recurren recurrent t connections time-delay oftsd may to mitigate this problem. . Lin (1996 ) introduced no now w diminish exp exponentially onentially as a function of τd rather thanetτal. . Since there are both d recurren t connections with a time-delay of to mitigate this problem. Gradients dela delay yed and single step connections, gradien gradients ts may still explo explode de exp exponen onen onentially tially in τ. no w diminish exp onentially as a function of rather than . Since there arenot both τ This allows the learning algorithm to capture longer dep dependencies endencies although all delayed and single step connections, gradien mayinstill long-term dep dependencies endencies may be represen represented tedtswell thisexplo way ay..de exp onentially in τ. This allows the learning algorithm to capture longer dependencies although not all long-term dep endencies may be represented well in this way. Another wa way y to obtain paths on which the product of deriv derivativ ativ atives es is close to one is to ha have ve units with self-connections and a weigh weightt near one on these connections. Another way to obtain paths on which the product of derivatives is close to one is to ( When we accumulate a running average µ t) of some value v (t) by applying the have units with self-connections and a weight near one on these connections. up update date µ(t) ← αµ (t−1) + (1 − α) v(t) the α parameter is an example of a linear selfWhen we accumulate of some value vav by applying µ one, t) µ(t−1) to aµ(running α is near connection from . Whenaverage the running average erage remem rememb bthe ers µ + (1 ) v α αµ α up date the parameter is an example of a linear selfinformation ab about out the past for a long time, and when α is near zero, information α is nearunits connection from to µ one, with the running average remembcan ers ← isµ rapidly −. When Hidden ab about out the past discarded. linear self-connections information about for aav long time,Such andhidden when αunits is near b eha ehave ve similarly to the suchpast running averages. erages. are zero, calledinformation le leaky aky units units.. about the past is rapidly discarded. Hidden units with linear self-connections can behave similarly to such running averages. 409 Such hidden units are called leaky units.
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Skip connections through d time steps are a wa way y of ensuring that a unit can alw always ays learn to be influenced by a value from d time steps earlier. The use of a d time Skip connections with through steps way of ensuring that a unit linear self-connection a weigh weight t near oneare is aa different wa way y of ensuring that can the d alw ays learn to b e influenced by a v alue from time steps earlier. The use of a unit can access values from the past. The linear self-connection approach allows linear self-connection with a weigh t near one is a different wa y of ensuring that the this effect to be adapted more smo smoothly othly and flexibly by adjusting the real-v real-valued alued unit can than accessbyvalues from the theinteger-v past. The linear approach allows α rather adjusting integer-valued alued skipself-connection length. this effect to be adapted more smoothly and flexibly by adjusting the real-valued These ideas were prop proposed osed by Mozer (1992) and by El Hihi and Bengio (1996). α rather than by adjusting the integer-valued skip length. Leaky units were also found to be useful in the con context text of ec echo ho state net netw works These ideas were prop osed by Mozer ( 1992 ) and by El Hihi and Bengio ( 1996). (Jaeger et al., 2007). Leaky units were also found to be useful in the context of echo state networks There are, 2007 tw two o ). basic strategies for setting the time constants used by leaky (Jaeger et al. units. One strategy is to manually fix them to values that remain constant, for Therebyare two basic forsome setting the time once constants used by leaky example sampling theirstrategies values from distribution at initialization time. units. One strategy is to manually fix them to v alues that remain constant, for Another strategy is to make the time constants free parameters and learn them. example by sampling theiratvalues from somescales distribution once at initialization time. Ha Having ving such leaky units differen different t time appears to help with long-term Another strategy is to make the time dep dependencies endencies (Mozer , 1992 ; Pascan Pascanu u etconstants al. al.,, 2013afree ). parameters and learn them. Having such leaky units at different time scales appears to help with long-term dependencies (Mozer, 1992; Pascanu et al., 2013a). Another approach to handle long-term dep dependencies endencies is the idea of organizing the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with Another approach handle long-term dependencies organizing information flowingto more easily through long distances is at the the idea slow slower eroftime scales. the state of the RNN at multiple time-scales (El Hihi and Bengio, 1996), with This ideaflowing differsmore fromeasily the skip connections through timeslow discussed earlier information through long distances at the er time scales. because it inv involves olves actively length-one connections and replacing them This idea differs from the skip connections time discussed with longer connections. Units mo modified dified in such a through wa way y are forced to op operate erateearlier on a because involves actively length-one and replacing long timeit scale. Skip connections through time connections edges. Units receiving them suc such h with longer connections. Units mo dified in such a wa y are forced to op erate on a new connections may learn to operate on a long time scale but may also choose to long time scale. Skip connections through time edges. Units receiving suc h fo focus cus on their other short-term connections. new connections may learn to operate on a long time scale but may also choose to There are different wa ways ys in which a group of recurrent units can be forced to focus on their other short-term connections. op operate erate at different time scales. One option is to make the recurrent units leaky leaky,, There are different wa ys in which a group of recurrent units can b e forced to but to hav havee different groups of units asso associated ciated with differen differentt fixed time scales. op erate at different time scales. One option is to make the recurrent units leaky This was the prop proposal osal in Mozer (1992) and has been successfully used in Pascan Pascanu u, but have ).different asso ciatedand with differen t dates fixed taking time scales. et al.to(2013a Anothergroups optionofis units to hav discrete up place have e explicit updates This w as the prop osal in Mozer ( 1992 ) and has been successfully used in Pascan u at different times, with a different frequency for different groups of units. This is et al.approach (2013a). ofAnother is to hav e explicit discrete up(dates taking place the El Hihioption and Bengio (1996 ) and and Koutnik et al. 2014). It work worked ed at different times, with a different frequency for different groups of units. This is well on a number of benchmark datasets. the approach of El Hihi and Bengio (1996) and Koutnik et al. (2014). It worked well on a number of benchmark datasets.
410
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
10.10
The Long Short-T Short-Term erm Memory and Other Gated RNNs 10.10 The Long Short-Term Memory and Other Gated As of this writing, the most effectiv effectivee sequence mo models dels used in practical applications RNNs are called gate gated d RNNs. These include the long short-term memory and netw networks orks As of this writing, the most effectiv e sequence mo dels used in practical applications based on the gate gated d recurr current ent unit unit.. are called gated RNNs. These include the long short-term memory and networks Lik units, gated RNNs are based on the idea of creating paths through Like e leaky based on the gated recurrent unit. time that hav havee deriv derivatives atives that neither v vanish anish nor explo explode. de. Leaky units did Lik e leaky units, gated RNNs are based on the idea of creating paths this with connection weigh weights ts that were either man manually ually chosen constan constants ts through or were time that hav e deriv atives that neither v anish nor explo de. Leaky units did parameters. Gated RNNs generalize this to connection weights that ma may y change thiseach withtime connection at step. weights that were either manually chosen constants or were parameters. Gated RNNs generalize this to connection weights that may change Leaky units allow the netw network ork to information (such as evidence at each time step. for a particular feature or category) ov over er a long duration. Ho Howev wev wever, er, once that Leaky units allow the netw ork to information (such information has been used, it might be useful for the neural net network work toas evidence the for state. a particular featureifor category)is ov er a oflong duration. Ho once that old For example, a sequence made sub-sequences andwev weer, wan want t a leaky information has been used, itinside mighteach be useful for the neural work to the unit to accum accumulate ulate evidence sub-subsequence, wenet need a mechanism to old state. F or example, if a sequence is made of sub-sequences and we wan t a leaky forget the old state by setting it to zero. Instead of manually deciding when to unit to evidence inside each sub-subsequence, we need a mechanism to clear theaccum state,ulate we wan want t the neural netw network ork to learn to decide when to do it. This forget old RNNs state bdo. y setting it to zero. Instead of manually deciding when to is whatthe gated clear the state, we want the neural network to learn to decide when to do it. This is what gated RNNs do. The clev clever er idea of introducing self-lo self-loops ops to pro produce duce paths where the gradient can flo flow w for long durations is a core contribution of the initial long short-term memory The clever idea introducing ops to pro duce).paths whereaddition the gradient (LSTM) mo model del (of Ho Hoc chreiter andself-lo Schmidh Schmidhub ub uber er , 1997 A crucial has bcan een flowmak for long durations is a core contribution of the initial long short-term memory to the weigh on this self-lo conditioned on the context, rather than fixed makee weightt self-loop op (Ho and Schmidh ubterof, 1997 A crucial addition has been ((LSTM) Gers et mo al.,del 2000 ). chreiter By making the weigh eight this ). self-loop gated (controlled by to mak e the weigh t on this self-lo op conditioned on the context, rather than fixed another hidden unit), the time scale of in integration tegration can be changed dynamically dynamically.. In (this Gerscase, et al. , 2000 ). By making the w eigh t of this self-loop gated (controlled we mean that even for an LSTM with fixed parameters, the time scale by of another hidden unit), the time scale of in tegration can b e changed dynamically . In in integration tegration can change based on the input sequence, because the time constants this output case, webymean that for The an LSTM with the time scale of are the mo model deleven itself. LSTM hasfixed beenparameters, found extremely successful integration can change such basedasonunconstrained the input sequence, becauserecognition the time constants in man many y applications, handwriting (Gra Grav ves are output by the mo del itself. The LSTM has b een found extremely successful et al. al.,, 2009), speech recognition (Gra Grav ves et al. al.,, 2013; Grav Graves es and Jaitly, 2014), in many applications, as unconstrained handwriting recognition (Gra ves), handwriting generation (such Gra , 2013 ), mac translation ( Sutsk Graves ves machine hine Sutskev ev ever er et al. al., , 2014 et al., captioning 2009), speech recognition (Gra;vVin es et al.et , 2013 ; Grav; es Jaitly , 2014 image (Kiros et al., 2014b Vinyals yals al. al.,, 2014b Xuand et al. al., , 2015 ) and), handwriting generation Graves parsing (Viny Vinyals als et al. al.,, (2014a ). , 2013), machine translation (Sutskever et al., 2014), image captioning (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015) and The (LSTM block corresponding onding parsing Vinyalsblo et ck al.,diagram 2014a). is illustrated in Fig. 10.16. The corresp forw propagation equations are giv b elow, in the case of a shallo recurrent forward ard given en shallow w The LSTM block diagram is illustrated in Fig. 10.16. The corresponding forward propagation equations are given 411b elow, in the case of a shallow recurrent
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.16: Blo Block ck diagram of the LSTM recurren recurrentt netw network ork “cell.” Cells are connected recurren recurrently tly to each other, replacing the usual hidden units of ordinary recurrent netw networks. orks. Figure 10.16: Block of with the LSTM recurren t netw ork “cell.” connected An input feature is diagram computed a regular artificial neuron unit.Cells Its are value can be recurren tly to into eachthe other, replacing the usual input hiddengate unitsallo of ws ordinary recurrent netw orks. accum accumulated ulated state if the sigmoidal allows it. The state unit has a An input feature is computed a regular artificial neuron Its ofvalue can can be linear self-lo self-loop op whose weigh weightt is with controlled by the forget gate. Theunit. output the cell accum the state if the gate ws it. nonlinearity The state unit hasthea b e shutulated off byinto the output gate. All sigmoidal the gating input units hav have e aallo sigmoid nonlinearity, , while linear self-lo op whose weigh t is controlled b y the forget gate. The output of the cell input unit can hav havee any squashing nonlinearity nonlinearity.. The state unit can also b e used ascan an be shut off by theblack gating unitsindicates have a sigmoid , while the extra input to the the output gating gate. units.All The square a delaynonlinearity of 1 time unit. input unit can have any squashing nonlinearity. The state unit can also b e used as an extra input to the gating units. The black square indicates a delay of 1 time unit.
412
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
net netwo wo work rk architecture. Deep Deeper er architectures ha hav ve also been successfully used (Gra Graves ves et al. al.,, 2013; Pascan Pascanu u et al. al.,, 2014a). Instead of a unit that simply applies an elementnetwo rk architecture. Deep er architectures havofe also been successfully used (Gra ves wise nonlinearity to the affine transformation inputs and recurrent units, LSTM et al., 2013 ; Pascan etveal. , 2014acells” ). Instead of aeunit that simply applies(aanself-lo elementrecurren recurrent t netw networks orks uha hav “LSTM that hav have an in internal ternal recurrence self-loop), op), wise nonlinearity to the affine transformation of inputs and recurrent units, LSTM in addition to the outer recurrence of the RNN. Eac Each h cell has the same inputs recurren t netwasorks ve “LSTM cells” that hav e anbut internal recurrence (a self-lo op), and outputs an ha ordinary recurrent netw network, ork, has more parameters and a in addition to the outer recurrence of the RNN. Eac h cell has the same inputs system of gating units that controls the flow of information. The most imp importan ortan ortantt and outputs as an ordinary recurrent network, but has more parameters and a (t) comp componen onen onentt is the state unit si that has a linear self-lo self-loop op similar to the leaky system of gating units that controls the flow of information. The most important units describ described ed in the previous section. How Howev ev ever, er, here, the self-lo self-loop op weigh weightt (or the component is the state unit s that has a linear self-loop similar to the leaky (t) asso associated ciated time constant) is controlled by a for forget get gate unit fi (for time step t units described in the previous section. However, here, the self-loop weight (or the and cell i), that sets this weigh weightt to a value betw etween een 0 and 1 via a sigmoid unit: associated time constant) is controlled by a forget gate unit f (for time step t and cell i), that sets this weight to a value between 0 and 1 via a sigmoid unit: (t) f (t) f (t−1) , (10.33) f i = σ bfi + Ui,j xj + Wi,j hj j
j
, (10.33) f =σ b + U x + W h where x (t) is the current input vector and h(t) is the current hidden lay layer er vector, con containing taining the outputs of all the LSTM cells, and bf ,U f , W f are resp respectively ectively x h where is the current input vector and is the current hidden lay er vector, biases, input weigh weights ts and recurrent weigh weights ts for the forget gates. The LSTM cell con taining theisoutputs of all the LSTM cells, and ab conditional ,U , W are resp in internal ternal state thus up updated dated as follows, but with self-lo self-loop opectively weigh weightt biases, input weights and recurrent weights for the forget gates. The LSTM cell (t) fi : internal state is thus updated as follows, but with a conditional self-loop weight f : (t−1) (t) (t) (t−1) (t) (t) si = fi s i + gi σ bi + Ui,j xj + Wi,j hj , (10.34) j
j
s =f s +g σ b + U x + W h , (10.34) where b, U and W resp respectiv ectiv ectively ely denote the biases, input weigh weights ts and recurren recurrentt (t) weight eightss into the LSTM cell. The external input gate unit g is computed similarly where b, U and W respectively denote the biases, inputi weights and recurrent to the forget gate (with a sigmoid unit to obtain a gating value betw etween een 0 and 1), g w eight s into the LSTM cell. The external input gate unit is computed similarly but with its own parameters: to the forget gate (with a sigmoid unit to obtain a gating value between 0 and 1), but with its own parameters: (t) g (t) g (t−1) gi = σ bgi + Ui,j xj + Wi,j hj . (10.35) j
j
g =σ b + U x + W h . (10.35) (t) (t) The output hi of the LSTM cell can also be shut off, via the output gate qi , whic which h also uses a sigmoid unit for gating: The output h of the LSTM cell can also be shut off, via the output gate q , (t) (t) (t) which also uses a sigmoid unit for h i = tanh si gating: qi (10.36) h = tanh s (t) qi = σ boi + q
=σ
b +
q j
U x 413
+
o Wi,j hj
(10.36) (10.37)
W h
(10.37)
(t−1)
(t) Uoi,j xj + j
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
whic which h has parameters bo, U o, W o for its biases, input weigh weights ts and recurren recurrentt (t) weights, resp respectively ectively ectively.. Among the variants, one can choose to use the cell state s i which has parameters b , U , W for its biases, input weights and recurrent as an extra input (with its weight) into the three gates of the i-th unit, as sho shown wn w respectively . Among the vthree ariants, one can parameters. choose to use the cell state s ineights, Fig. 10.16 . This would require additional as an extra input (with its weight) into the three gates of the i-th unit, as shown LSTM net networks works hav havee been sho shown wn to learn long-term dep dependencies endencies more easily in Fig. 10.16. This would require three additional parameters. than the simple recurrent architectures, first on artificial data sets designed for LSTM works e blong-term een showndep to endencies learn long-term dep easily testing the net abilit learn (Bengio etendencies al., 1994;more Hochreiter ability y tohav dependencies thanSc the simple recurrent architectures, first on),artificial sets designed for and Schmidhuber hmidhuber , 1997; Ho Hochreiter chreiter et al. al., , 2000 then ondata challenging sequence testing the tasks abilitywhere to learn long-term deppendencies (Bengio et al., 1994 ; Hochreiter pro processing cessing state-of-the-art erformance was obtained (Gra Graves ves, 2012; and Sc hmidhuber , 1997 ; Ho chreiter et al. , 2000 ), then on challenging sequence Gra Graves ves et al. al.,, 2013; Sutskev Sutskever er et al. al.,, 2014). Varian ariants ts and alternativ alternatives es to the LSTM pro tasks where state-of-the-art performance ha have vecessing been studied and used and are discussed next. was obtained (Graves, 2012; Graves et al., 2013; Sutskever et al., 2014). Variants and alternatives to the LSTM have been studied and used and are discussed next. Whic Which h pieces of the LSTM architecture are actually necessary? What other successful architectures could be designed that allow the netw network ork to dynamically Whic h pieces of the LSTM architecture are actually necessary? What other con control trol the time scale and forgetting beha ehavior vior of different units? successful architectures could be designed that allow the network to dynamically Some answers to these questions are given with the recent work on gated RNNs, control the time scale and forgetting behavior of different units? whose units are also known as gated recurrent units or GR GRUs Us (Cho et al., 2014b; Some answers to these questions are given with the recent ork on gated Ch Chung ung et al. al.,, 2014, 2015a; Jozefowicz et al. al.,, 2015; Chrupala et wal. al., , 2015 ). TheRNNs, main whose units are also known as gated recurrent units or GR Us ( Cho et al. , 2014b difference with the LSTM is that a single gating unit simultaneously controls the; Chung et al. , 2014and , 2015a Jozefowicz et date al., 2015 Chrupala al.,up 2015 ). equations The main forgetting factor the ;decision to up update the ;state unit. etThe update date difference with the LSTM is that a single gating unit simultaneously controls the are the following: forgetting factor and the decision to update the state unit. The update equations are the following: (t−1) (t−1) (t−1) (t−1) (t−1) (t−1) (t) + (1 − ui h i = ui hi )σ bi + Ui,j xj + Wi,j rj hj , h
=u
h
+ (1
u
− where u stands for “up “update” date” usual: where u stands for “update” (t) usual: ui = σ and and
u
=σ
)σ
b +
j
j
(t) r i = σ b ri +
U x
j
+
r (t) Ui,j xj + j
+
j
h (10.38) , gate and r for “reset” gate. Their value is defined as (10.38) gate and r for “reset” gate. Their value is defined as (t) u (t) bui + Uui,jx j + W i,j hj (10.39) b +
U x
W h r (t) W i,j hj
j
W r
(10.39) .
(10.40)
r =σ b + U x + W h . (10.40) The reset and up updates dates gates can individually “ignore” parts of the state vector. The up update date gates act like conditional leaky integrators that can linearly gate any The reset and updates gates can individually “ignore” parts of the state vector. 414 integrators that can linearly gate any The update gates act like conditional leaky
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
dimension, th thus us choosing to cop copy y it (at one extreme of the sigmoid) or completely ignore it (at the other extreme) by replacing it by the new “target state” value dimension, thus the choosing copy it (at one to extreme of theThe sigmoid) completely (to (towards wards which leaky to in integrator tegrator wants conv converge). erge). reset or gates control ignore it (at the other extreme) b y replacing it by the new “target state” value whic which h parts of the state get used to compute the next target state, in intro tro troducing ducing an (to wards which the leaky in tegrator w ants to conv erge). The reset gates control additional nonlinear effect in the relationship betw etween een past state and future state. which parts of the state get used to compute the next target state, introducing an Man Many y more varian ariants ts around this theme can be designed. For example the additional nonlinear effect in the relationship between past state and future state. reset gate (or forget gate) output could be shared across multiple hidden units. Many more varian this theme can baewhole designed. orunits, example Alternately Alternately, , the pro product ducttsofaround a global gate (cov (covering ering groupFof suc such hthe as reset gate (or forget gate) output could b e shared across m ultiple hidden units. an entire lay layer) er) and a lo local cal gate (p (per er unit) could be used to combine global control Alternately , the product of a global (covering a whole group of units, such as and lo local cal control. How However, ever, severalgate in investigations vestigations over architectural variations an the entire layer)and andGRU a local gate no (per unit)t could be usedclearly to combine global of LSTM found varian ariant that would beat b oth ofcontrol these and lo cal control. How ever, several in vestigations o v er architectural v ariations across a wide range of tasks (Greff et al., 2015; Jozefo Jozefowicz wicz et al. al.,, 2015). Greff of the LSTM and GRU found no v arian t that would clearly beat both of these et al. (2015) found that a crucial ingredien ingredientt is the forget gate, while Jozefowicz across a wide rangethat of tasks (Greff et of al.,1 2015 ; Jozefo wicz et al. , 2015 ). Greff et al. (2015 ) found adding a bias to the LSTM forget gate, a practice et al. (2015b)yfound ingredien is the forget gate,aswhile Jozefowicz adv advo ocated Gers that et al.a(crucial 2000), mak makes es thet LSTM as strong the best of the et al. ( 2015 ) found that adding a bias of 1 to the LSTM forget gate, a practice explored architectural varian ariants. ts. advocated by Gers et al. (2000), makes the LSTM as strong as the best of the explored architectural variants.
10.11
Optimization for Long-T Long-Term erm Dep Dependencies endencies
Sec. 8.2.5 Optimization and Sec. 10.7 ha have vefor describ described ed the erm vanishing and explo exploding ding gradien gradientt 10.11 Long-T Dep endencies problems that occur when optimizing RNNs ov over er many time steps. Sec. 8.2.5 and Sec. 10.7 have described the vanishing and exploding gradient An in interesting teresting idea prop proposed osed by Martens and Sutskev Sutskever er (2011) is that second problems that occur when optimizing RNNs over many time steps. deriv derivatives atives may vanish at the same time that first deriv derivatives atives vanish. Second-order An interesting idea prop by Martens andoSutskev er (2011 is that second optimization algorithms mayosed roughly be understo understoo d as dividing the) first deriv derivativ ativ ativee deriv atives may v anish at the same time that first deriv atives v anish. Second-order by the second deriv derivativ ativ ativee (in higher dimension, multiplying the gradien gradientt by the optimization algorithms may roughly b e understo o d as dividing the firsttoderiv in inverse verse Hessian). If the second deriv derivativ ativ ativee shrinks at a similar rate the ativ firste b y the second e (in higher dimension, multiplying theremain gradienrelatively t by the deriv derivative, ative, thenderiv theativ ratio of first and second deriv derivativ ativ atives es may in verse t. Hessian). If the, second derivativ e shrinks a similar rateks,toincluding the first constan constant. Unfortunately Unfortunately, second-order metho methods ds hav haveeatmany drawbac drawbacks, derivative, then the cost, ratio the of first atives may relatively high computational needand for second a large deriv minibatch, and aremain tendency to be constan t. Unfortunately , second-order metho ds hav e many drawbac ks, including attracted to saddle points. Martens and Sutskev Sutskever er (2011) found promising results high computational cost, the a large a tendency to be using second-order metho methods. ds. need Later,forSutsk Sutskever everminibatch, et al. (2013and ) found that simpler attracted to saddle points. Martens and with Sutskev er (2011 ) found promising results metho methods ds suc such h as Nesterov momentum careful initialization could ac achieve hieve using second-order metho ds. Later, Sutsk ever et al. ( 2013 ) found that simpler similar results. See Sutsk Sutskev ev ever er (2012) for more detail. Both of these approac approaches hes metho ds suc h as Nesterov momentum with careful initialization could ac hieve ha have ve largely been replaced by simply using SGD (even without momentum) applied similar results. Sutsk (2012 ) for moreindetail. Both of these hes to LSTMs. ThisSee is part of ev a er contin continuing uing theme mac machine hine learning thatapproac it is often hauch ve largely replaced by (even without m easier b toeen design a mo model delsimply that isusing easySGD to optimize than itmomentum) is to design applied a more to LSTMs. This is part of a contin uing theme in mac hine learning that it is often powerful optimization algorithm. much easier to design a model that is easy to optimize than it is to design a more powerful optimization algorithm. 415
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
As discussed in Sec. 8.2.4, strongly nonlinear functions suc such h as those computed by a recurrent net ov over er many time steps tend to hav havee deriv derivatives atives that can be either As discussed in Sec. 8.2.4 , strongly nonlinear functions suc as those computed by, very large or very small in magnitude. This is illustrated inhFig. 8.3 and Fig. 10.17 a recurrent er the many time tend (as to hav e derivatives can be either in which we net see ov that ob objectiv jectiv jectivesteps e function a function of the that parameters) has a v ery largee” or in very small in finds magnitude. illustrated 8.3 and Fig. 10.17 “landscap “landscape” whic which h one “cliffs”:This wideisand rather in flatFig. regions separated by, in which we see thatthe theob ob jectiv functionchanges (as a function the parameters) tin tiny y regions where objectiv jectiv jective e efunction quickly quickly,,offorming a kind of has cliff.a “landscape” in which one finds “cliffs”: wide and rather flat regions separated by difficulty is ethat when changes the parameter ery large, tinyThe regions wherethat the arises ob jectiv function quickly,gradient forming isa vkind of cliff.a gradien gradientt descen descentt parameter up update date could throw the parameters very far, into a Thewhere difficulty that arises is that is when theundoing parameter gradient verythat large, region the ob objective jective function larger, much of the iswork hada gradien t descen t parameter update couldThe throw the parameters far, into a b een done to reach the current solution. gradient tells us thevery direction that region where the ob jective function is larger, undoing m uch of the w ork that had corresp corresponds onds to the steep steepest est descent within an infinitesimal region surrounding the b een done to reach the current The gradient tellsthe us cost the direction curren parameters. Outside ofsolution. this infinitesimal region, function that ma currentt may y corresp onds to the steep est descent within an infinitesimal region surrounding begin to curve back upw upwards. ards. The up update date must be chosen to be small enoughthe to t parameters. Outside of infinitesimal region, the cost function y acurren void tra traversing versing to too o muc much h upw upward ardthis curv curvature. ature. We typically use learning rates ma that b egin to curve back upw ards. The up date must b e c hosen to b e small enough to deca decay y slowly enough that consecutive steps hav havee appro approximately ximately the same learning a void tra versing to o muc h upw ard curv ature. W e typically use of learning rates that rate. A step size that is appropriate for a relatively linear part the landscap landscape e is decay inappropriate slowly enoughand thatcauses consecutive steps hav e eappro the same learning often uphill motion if w enterximately a more curv curved ed part of the rate. A step appropriate for a relatively linear part of the landscap e is landscap landscape e onsize the that nextisstep. often inappropriate and causes uphill motion if we enter a more curved part of the landscape on the next step.
416
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.17: Example of the effect of gradien gradientt clipping in a recurrent net netw work with two parameters w and b . Gradient clipping can mak makee gradien gradientt descent p erform more Figure 10.17: Example of of the effect of steep gradien t clipping a recurrent networkowith reasonably in the vicinity extremely cliffs. These in steep cliffs commonly ccur w b tinworecurrent parameters and . Gradient clipping can mak e gradien t descent p erform more. netw networks orks near where a recurren recurrentt netw network ork b ehav ehaves es approximately linearly linearly. reasonably in the vicinitysteep of extremely steep These cliffsthe commonly o ccur The cliff is exp exponentially onentially in the num numb b er cliffs. of time stepssteep b ecause weigh weightt matrix in multiplied recurrent netw orksonce nearfor where recurren b ehaves approximately linearly. is by itself eachatime step.t network Gradient descent without gradient The cliff ov is ersho exp onentially steep in b erravine, of timethen steps b ecause the weigh matrix clipping oversho ershoots ots the b ottom of the thisnum small receives a very large tgradient is multiplied itself each time step. Gradient descent without gradient from the cliff by face. Theonce largeforgradien gradient t catastrophically prop propels els the parameters outside the clipping oversho of this small ravine, then clipping receives has a very largemo gradient axes of the plot. ots the b ottom Gradient descent with gradient a more moderate derate from the to cliff face. largeit gradien t catastrophically thesize parameters outside the reaction the cliff.The While do does es ascend the cliff face,prop theels step is restricted so that axes of theb eplot. Gradient descent withnear gradient clipping has a more mo derate it cannot prop propelled elled aw awa ay from steep region the solution. Figure adapted with reaction to from the cliff. While ascend p ermission Pascan Pascanu u it do(es 2013a ). the cliff face, the step size is restricted so that it cannot b e prop elled away from steep region near the solution. Figure adapted with permission from Pascanu (2013a).
A simple type of solution has been in use by practitioners for many years: clipping the gr gradient adient adient.. There are differen differentt instances of this idea (Mikolo Mikolov v, 2012; A simple type of solution has b een in use b y practitioners for many years: Pascanu et al. al.,, 2013a). One option is to clip the parameter gradien gradientt from a clipping the gr adient . There are differen t instances of this idea ( Mikolo v , 2012 minibatc minibatch h element-wise (Mikolo Mikolov v, 2012) just before the parameter up update. date. Another; Pascanu , 2013a ). ofOne option clip theetparameter t from ||g|| is to clip et theal. norm the gr gradient adientisgto (Pascanu al. al.,, 2013a)gradien just before thea minibatch element-wise (Mikolov, 2012) just before the parameter up date. Another parameter up update: date: is to clip the norm g of the gradient g (Pascanu et al., 2013a) just before the parameter up date: || || if ||g|| > v (10.41) gv (10.42) if g g>←v ||g|| (10.41) gv || ||g (10.42) where v is the norm threshold and g is used gto up update date parameters. Because the gradien gradientt of all the parameters (including←different || || groups of parameters, such as where is the normisthreshold and gjointly is used to up date parameters. Because the w eightsv and biases) renormalized with a single scaling factor, the latter gradien of all parameters different groups such as metho method dthas thethe adv advan an antage tage that(including it guarantees that each stepofisparameters, still in the gradient w eights and biases) is renormalized jointly with a single scaling factor, the latter direction, but experiments suggest that both forms work similarly similarly.. Although method has the advantage that it guarantees that each step is still in the gradient 417 b oth forms work similarly. Although direction, but experiments suggest that
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
the parameter up update date has the same direction as the true gradient, with gradient norm clipping, the parameter up update date vector norm is now bounded. This bounded the parameter up date has the same direction as when the true gradient gradien gradientt avoids performing a detrimen detrimental tal step the gradient, gradien gradientt with explo explodes. des. In norm clipping, the parameter up date vector norm is now b ounded. This b fact, ev even en simply taking a random step when the gradien gradientt magnitude isounded abov abovee gradien t a voids p erforming a detrimen tal step when the gradien t explo des. In a threshold tends to work almost as well. If the explosion is so sev severe ere that the fact, evtenissimply taking a random step when the gradien magnitude is abov gradien gradient numerically or (considered infinite or tnot-a-n not-a-num um umb ber), thene tends work almost as well. the texplosion is vso erefrom that the the v can aa threshold random step of to size be taken andIfwill ypically mo mov e asev way gradien t is numerically or (considered infinite or not-a-n umber), then n umerically unstable configuration. Clipping the gradient norm per-minibatch will v a random step of size can b e taken and will t ypically mo v e a w ay from the not change the direction of the gradien gradientt for an individual minibatc minibatch. h. Ho Howev wev wever, er, numerically Clipping the gradient norm per-minibatch taking the aunstable verage ofconfiguration. the norm-clipped gradient from man many y minibatches is will not not change direction the gradien t for gradien an individual minibatcformed h. Howev er, equiv equivalent alent tothe clipping theofnorm of the true gradient t (the gradient from takingallthe average of the norm-clipped from norm, many as minibatches is not using examples). Examples that hav havee gradient large gradient well as examples equiv alent to clipping the norm of the true gradien t (the gradient formed from that app appear ear in the same minibatch as such examples, will hav havee their contribution using examples). that havstands e largeingradient as well asminibatch examples to theall final direction Examples diminished. This con contrast trastnorm, to traditional that app in thewhere samethe minibatch as such examples, will hav their contribution gradien gradient t ear descent, true gradien gradient t direction is equal to ethe av average erage ov over er all to the final direction diminished. This stands in con trast to traditional minibatch minibatc minibatch h gradients. Put another wa way y, traditional stochastic gradient descent uses gradien t descent, where the true gradien t direction is equal tot with the avnorm erageclipping over all an unbiased estimate of the gradient, while gradient descen descent minibatc h gradients. y, traditional stochastic gradient descent uses in a heuristicPut biasanother that wewakno to be useful. With elementintro tro troduces duces know w empirically an unbiased of the of gradient, whileisgradient descen t with clippingt wise clipping,estimate the direction the update not aligned with the norm true gradien gradient intro duces a heuristic bias that kno w empirically to be useful. elementor the minibatc minibatch h gradient, but we it is still a descen descentt direction. It With has also been wise clipping, the direction of the update is not aligned with the true gradien prop proposed osed (Grav Graves es, 2013) to clip the bac back-propagated k-propagated gradient (with respect tot or the minibatc h gradient, but ithas is still descent direction. It has also been hidden units) but no comparison beenapublished betw etween een these variants; we proposed (that Gravall es, these 2013)metho to clip bacek-propagated gradient (with respect to conjecture methods dsthe behav ehave similarly similarly.. hidden units) but no comparison has been published between these variants; we conjecture that all these methods behave similarly. Gradien Gradientt clipping helps to deal with explo exploding ding gradients, but it do does es not help with vanishing gradients. To address vanishing gradients and better capture long-term Gradien t clipping helps to deal withofexplo ding paths gradients, it does not help with dep dependencies, endencies, we discussed the idea creating in thebut computational graph of vthe anishing gradients. T o address v anishing gradients and b etter capture long-term unfolded recurren recurrentt architecture along which the pro product duct of gradients associated dep endencies, we discussed the idea of creating paths in the computational graphops of with arcs is near 1. One approach to ac achiev hiev hievee this is with LSTMs and other self-lo self-loops the unfolded recurren t architecture along which the pro duct of gradients associated and gating mechanisms, described ab abov ov ovee in Sec. 10.10. Another idea is to regularize with arcs is near One approach hieve this is“information with LSTMsflow.” and other self-loops or constrain the1.parameters so astotoacencourage In particular, and gatinglike mechanisms, described inbSec. . Another ideatoismaintain to regularize w e would the gradient vectorab eing10.10 back-propagated its ∇ove L or constrain the parameters so as to encourage “information flow.” In particular, magnitude, even if the loss function only penalizes the output at the end of the L being back-propagated to maintain its w e would like the gradient vector sequence. Formally ormally, , we wan want t magnitude, even if the loss function ∇ only penalizes the output at the end of the sequence. Formally, we want ∂ h (t) (∇ L) (t−1) (10.43) ∂h ∂h L) (10.43) ( 418∂ h ∇
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
to b e as large as ∇ L. (10.44) to b e as large as With this ob objective, jective, Pascan Pascanu u et al. (2013aL. ) prop propose ose the following regularizer: (10.44) ∇ ) propose the following 2 With this ob jective, Pascanu et al. (2013a regularizer: | (∇ L) ∂∂ | −1 Ω= . (10.45) ||∇ L | | L ) ( t Ω= . (10.45) | ∇ | 1 L Computing the gradien gradientt of this regularizer may app appear Pascanu u − ear difficult, but Pascan ||∇ | | et al. (2013a) prop propose ose an appro approximation ximation in which we consider the bac back-propagated k-propagated Computing the gradien t of this regularizer may app ear difficult, but Pascan vectors ∇ L as if they were constants (for the purpose of this regularizer, sou et al. there (2013ais) no prop ose an ximation inthrough which wthem). e consider theexperiments back-propagated that need to appro bac The with back-propagate k-propagate L vthis ectors as if they were constants (for the purpose of this regularizer, so regularizer suggest that, if com combined bined with the norm clipping heuristic (which that there is no need to bacthe k-propagate through them). The experiments with ∇ handles gradient explosion), regularizer can considerably increase the span of this regularizer suggest that, if com bined with the norm clipping heuristic (which the dep dependencies endencies that an RNN can learn. Because it keeps the RNN dynamics handles gradient explosion), the regularizer can clipping considerably increase the span of on the edge of explosive gradients, the gradient is particularly important. the dependencies that an RNN canexplosion learn. Because it keeps thefrom RNN dynamics Without gradient clipping, gradient preven prevents ts learning succeeding. on the edge of explosive gradients, the gradient clipping is particularly important. A keygradient weakness of this approac approach is that it preven is not ts as learning effective from as thesucceeding. LSTM for Without clipping, gradienthexplosion tasks where data is abundant, such as language mo modeling. deling. A key weakness of this approach is that it is not as effective as the LSTM for tasks where data is abundant, such as language modeling.
10.12
Explicit Memory
In Intelligence telligenceExplicit requires knowledge and acquiring knowledge can be done via learning, 10.12 Memory whic which h has motiv motivated ated the developmen developmentt of large-scale deep architectures. Ho Howev wev wever, er, Intelligence requireskinds knowledge and acquiring knowledge cancan be done via learning, there are different of knowledge. Some knowledge be implicit, subwhic h has motiv ated the to developmen t of large-scale deep architectures. Howev er, conscious, and difficult verbalize—suc verbalize—such h as how to walk, or ho how w a dog looks there are different knowledge. knowledge can be and implicit, subdifferen different t from a cat.kinds Otherofknowledge canSome be explicit, declarative, relativ relatively ely conscious, and difficult to verbalize—suc h as how to w alk, or ho w a dog looks straigh straightforward tforward to put into words—ev words—every ery da day y commonsense knowledge, like “a cat differen t from a cat. Other can b e explicit, declarative, and relatively is a kind of animal,” or veryknowledge sp facts that you need to know to accomplish specific ecific tforward to put ery dathe y commonsense cat ystraigh our current goals, likeinto “thewords—ev meeting with sales team isknowledge, at 3:00 PMlike in “a ro room om is a kind of animal,” or very specific facts that you need to know to accomplish 141.” your current goals, like “the meeting with the sales team is at 3:00 PM in room Neural netw networks orks excel at storing implicit kno knowledge. wledge. How Howev ev ever, er, they struggle 141.” to memorize facts. Sto Stochastic chastic gradien gradientt descent requires many presen presentations tations of Neural netw orks excel at storing implicit kno wledge. How ev er, they struggle the same input before it can be stored in a neural netw network ork parameters, and even to memorize facts. Sto chastic gradien t descent requires many presen tations of) then, that input will not be stored esp especially ecially precisely precisely.. Gra Graves ves et al. (2014b the same input b efore it can b e stored in a neural netw ork parameters, and even hyp ypothesized othesized that this is because neural net netw works lac lack k the equiv equivalen alen alentt of the working then, that inputthat willallows not be stored especially precisely . and Graves et al. (2014b memory system human beings to explicitly hold manipulate pieces) hypinformation othesized that thisare is brelev ecause networks lack goal. the equiv alen t of the memory working of that relevan an anttneural to ac achieving hieving some Suc Such h explicit memory system that allows human beings to explicitly hold and manipulate pieces 419 of information that are relevant to achieving some goal. Such explicit memory
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
Figure 10.18: A schematic of an example of a net network work with an explicit memory memory,, capturing some of the key design elements of the neural Turing mac machine. hine. In this diagram we Figure 10.18:theA “representation” schematic of an example of amo netdel work with an net explicit memory distinguish part of the (the “task here a, capturing recurrent model network,” work,” some of the key design elements of the neural T uring mac hine. In this diagram we net in the b ottom) from the “memory” part of the model (the set of cells), whic which h can distinguish the “representation” parttoof“con thetrol” model “task network,” heretoa read recurrent store facts. The task netw network ork learns “control” the (the memory memory, , deciding where from net in the to b ottom) from the the “memory” of thethe model (theand set writing of cells),mec whic h can and where write to within memorypart (through reading mechanisms, hanisms, store facts.byThe task netwp ork learnsattothe “con trol” the memory , deciding where to read from indicated bold arrows ointing reading and writing addresses). and where to write to within the memory (through the reading and writing mechanisms, indicated by bold arrows pointing at the reading and writing addresses).
420
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
comp componen onen onents ts would allow our systems not only to rapidly and “inten “intentionally” tionally” store and retriev retrievee sp specific ecific facts but also to sequentially reason with them. The need comp onents would our systems not only to rapidly and “inten tionally” store for neural net networks worksallow that can pro process cess information in a sequence of steps, changing and wa retriev specific facts butthe also to ork sequentially reason The need the way y thee input is fed in into to netw network at each step, haswith long them. been recognized for neural net works that can pro cess information in a sequence of steps, as important for the abilit ability y to reason rather than to make automatic, changing in intuitiv tuitiv tuitivee the wa y the is fed into the netw resp responses onses to input the input (Hinton , 1990 ). ork at each step, has long been recognized as important for the ability to reason rather than to make automatic, intuitive To resolve this difficulty difficulty,, Weston et al. (2014) in intro tro troduced duced memory networks that responses to the input (Hinton, 1990). include a set of memory cells that can be accessed via an addressing mechanism. To resolve this difficulty , Wrequired eston et al. (2014 ) introsignal duced instructing memory networks Memory net networks works originally a sup supervision ervision them that ho how w include a set memory of memory cellsGrav thatescan be (accessed via an addressing mechanism. to use their cells. Graves et al. 2014b) in introduced troduced the neur neural al Turing Memory works originally a supand ervision instructing them how machine machine,, net which is able to learnrequired to read from writesignal arbitrary conten contentt to memory to use their memory cells. Graves et (2014b ) introduced the e, neur al allo Turing cells without explicit sup supervision ervision ab about outal.which actions to undertak undertake, and allowed wed machine , which is able to learn to read from and write arbitrary conten t to memory end-to-end training without this sup supervision ervision signal, via the use of a conten content-based t-based cells atten without supervision about which undertak e, and ). alloThis wed soft attention tionexplicit mec mechanism hanism (see Bahdanau et al.actions (2015)toand Sec. 12.4.5.1 end-to-end training withouthas thisbsup ervision signal, via other the use of a conten t-based soft addressing mechanism ecome standard with related architectures soft attention mechanism (see Bahdanau et that al. (still 2015allows ) and Sec. 12.4.5.1 ). This em emulating ulating algorithmic mechanisms in a way gradien gradient-based t-based optisoft addressing mechanism has b ecome standard with other related architectures mization (Sukhbaatar et al., 2015; Joulin and Mikolo Mikolov v, 2015; Kumar et al. al.,, 2015; emulating algorithmic mechanismsetinal. a ,w2015 ay that Vin Vinyals yals et al. al., , 2015a; Grefenstette al., ). still allows gradient-based optimization (Sukhbaatar et al., 2015; Joulin and Mikolov, 2015; Kumar et al., 2015; Eac Each hetmemory cell can be though thought as an Vinyals al., 2015a ; Grefenstette et tal.of , 2015 ). extension of the memory cells in LSTMs and GRUs. The difference is that the netw network ork outputs an in internal ternal state Eac h memory cell can be though t of as an extension of the memory cells that chooses which cell to read from or write to, just as memory accesses in in a LSTMscomputer and GRUs. difference the netw ork outputs an internal state digital readThe from or writeistothat a sp specific ecific address. that chooses which cell to read from or write to, just as memory accesses in a It iscomputer difficult to optimize pro produce duce exact, integer addresses. To digital read from orfunctions write to that a specific address. alleviate this problem, NTMs actually read to or write from many memory cells It is difficult optimize that duce exact, integer To sim simultaneously ultaneously ultaneously. . Tto o read, they functions take a weigh weighted tedpro average of many cells. addresses. To write, they alleviate this problem, actually read to orco write from for many memory cells mo modify dify multiple cells byNTMs different amounts. The coefficients efficients these op operations erations simultaneously . Tofocused read, they a weigh ted verage many cells. T o write, they are chosen to be on atake small num numb ber aof cells,offor example, by pro producing ducing modify cells by different The efficients forderiv theseativ opes erations them viamultiple a softmax function. Usingamounts. these weigh weights ts co with non-zero derivativ atives allows are chosen to be focused on a small num b er of cells, for example, by pro ducing the functions controlling access to the memory to be optimized using gradien gradientt them via a softmax function. these weigh ts with non-zero deriv es should allows descen descent. t. The gradien gradient t on theseUsing co coefficients efficients indicates whether each of ativ them the functionsorcontrolling to gradient the memory to be optimized gradien b e increased decreased,access but the will typically be large using only for thoset descent. addresses The gradien t on these coefficients indicates whether each of them should memory receiving a large co coefficient. efficient. be increased or decreased, but the gradient will typically be large only for those These memory cells are typically augmen augmented ted to con contain tain a vector, rather than memory addresses receiving a large coefficient. the single scalar stored by an LSTM or GRU memory cell. There are tw two o reasons These memory cells are t ypically augmen ted to con tain a vector, rather to increase the size of the memory cell. One reason is that we ha have ve increasedthan the the single scalar stored by an LSTM or GRU memory cell. There are tw o reasons cost of accessing a memory cell. We pay the computational cost of pro producing ducing a to increase the size of the memory cell. One reason is that w e ha ve increased the co coefficien efficien efficientt for many cells, but we exp expect ect these co coefficients efficients to cluster around a small cost of accessing We pay the rather computational cost ofvalue, producing a n umber of cells. aBymemory readingcell. a vector value, than a scalar we can coefficient for many cells, but we exp ect these coefficients to cluster around a small number of cells. By reading a vector 421 value, rather than a scalar value, we can
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
offset some of this cost. Another reason to use vector-v vector-valued alued memory cells is that they allo allow w for content-b ontent-base ase ased d addr addressing essing essing,, where the weight used to read to or write offset asome cost. Another use vector-v aluedallow memory is that from cell of is this a function of that reason cell. Vto ector-v ector-valued alued cells us tocells retriev retrieve ea they allo w for c ontent-b ase d addr essing , where the w eight used to read to or write complete vector-v ector-valued alued memory if we are able to pro produce duce a pattern that matches from a cell is a function of that cell. V ector-v alued cells us topeople retrievcan ea some but not all of its elements. This is analogous to theallow way that complete ector-v memory if we are words. able to W pro ducethink a pattern that matches recall the vlyrics ofalued a song based on a few e can of a conten content-based t-based some but not all of its elements. This is analogous to the w a y that people cane read instruction as sa saying, ying, “Retrieve the lyrics of the song that has the chorus ‘W ‘We recall of submarine.’ a song based” on a few words.addressing We can think of auseful conten t-based all liv liveethe in alyrics yellow Conten Content-based t-based is more when we read instruction as sa ying, “Retrieve the lyrics of the song that has the chorus ‘W mak makee the ob objects jects to be retrieved large—if ev every ery letter of the song was stored in ae all live inmemory a yellowcell, submarine.’ Conten t-based addressing is more when we separate we would” not be able to find them this wa way y. useful By comparison, mak e the ase obdjects to be retrieved large—if everytoletter of the tsong was stored. in lo loccation-b ation-base ased addr addressing essing is not allow allowed ed to refer the conten content of the memory memory. Wae separate cell, we wouldread not binstruction e able to find this way. By comparison, can thinkmemory of a lo location-based cation-based as them saying “Retriev “Retrieve e the lyrics of lo c ation-b ase d addr essing is not allow ed to refer to the conten t of the memory . We the song in slot 347.” Lo Location-based cation-based addressing can often be a perfectly sensible can think ofeven a lowhen cation-based read cells instruction as saying “Retrieve the lyrics of mec mechanism hanism the memory are small. the song in slot 347.” Location-based addressing can often be a perfectly sensible the con conten ten tentt when of a memory cell is copied (not forgotten) at most time steps, then mecIfhanism even the memory cells are small. the information it contains can be propagated forward in time and the gradients If the conbackw tent ofard a memory is copied (not forgotten) most time steps, then propagated backward in time cell without either vanishing or at explo exploding. ding. the information it contains can be propagated forward in time and the gradients The explicit memory approac approach h is illustrated in Fig. 10.18, where we see that propagated backward in time without either vanishing or exploding. a “task neural net network” work” is coupled with a memory memory.. Although that task neural The explicit memory approac h is illustrated in Fig. 10.18 wet see that net netwo wo work rk could be feedforward or recurren recurrent, t, the ov overall erall system is ,awhere recurren recurrent netw network. ork. a “task neural net work” is coupled with a memory . Although that task neural The task netw network ork can choose to read from or write to sp specific ecific memory addresses. network memory could be seems feedforward ormo recurren the ov erall that system is a recurren ork. Explicit to allo allow w models dels tot,learn tasks ordinary RNNs tornetw LSTM The task netwlearn. ork can choose tofor read ortage write to bspeecific memory addresses. RNNs cannot One reason thisfrom adv advan an antage may because information and Explicit memory seems to allo w mo dels to learn tasks that ordinary RNNs or LSTM gradien gradients ts can be propagated (forward in time or backw backwards ards in time, resp respectively) ectively) RNNs cannot learn. One reason for this adv an tage may b e b ecause information and for very long durations. gradients can be propagated (forward in time or backwards in time, respectively) As anlong alternative to back-propagation through weigh weighted ted av averages erages of memory for very durations. cells, we can interpret the memory addressing co coefficients efficients as probabilities and As an alternative to back-propagation through weigh ted).avOptimizing erages of memory sto stocchastically read just one cell (Zaremba and Sutskev Sutskever er, 2015 mo models dels cells, w e can interpret the memory addressing co efficients as probabilities and that mak makee discrete decisions requires sp specialized ecialized optimization algorithms, describ described ed stoSec. chastically just training one cell (these Zaremba and Sutskev er, 2015). that Optimizing models in 20.9.1.read So far, sto stochastic chastic architectures make discrete that makeremains discrete harder decisions requires specialized optimization algorithms, describ ed decisions than training deterministic algorithms that make soft in Sec. 20.9.1. So far, training these stochastic architectures that make discrete decisions. decisions remains harder than training deterministic algorithms that make soft Whether it is soft (allo (allowing wing back-propagation) or sto stochastic chastic and hard, the mechdecisions. anism for choosing an address is in its form iden identical tical to the attention me mechanism chanism Whether it is soft (allo wing back-propagation) or sto chastic and hard, the whic which h had been previously introduced in the context of machine translation (mechBahanism for choosing an address is in its form iden tical to the attention me chanism danau et al. al.,, 2015) and discussed in Sec. 12.4.5.1. The idea of atten attention tion mechanisms whic h had b een previously introduced in the context of machine translation (Bahfor neural netw networks orks was in intro tro troduced duced even earlier, in the context of handwriting danau et al.(,Grav 2015es ) and discussed 12.4.5.1 . The idea that of atten mechanisms generation Graves , 2013 ), with in anSec. attention mechanism wastion constrained to for neural networks was introduced even earlier, in the context of handwriting generation (Graves, 2013), with an attention mechanism that was constrained to 422
CHAPTER 10. SEQUENCE MODELING: RECURRENT AND RECURSIVE NETS
mo move ve only forw forward ard in time through the sequence. In the case of machine translation and memory net networks, works, at eac each h step, the fo focus cus of attention can mov movee to a completely mo ve only forw ard in time through the sequence. In the case of machine translation differen differentt place, compared to the previous step. and memory networks, at each step, the focus of attention can move to a completely Recurren Recurrent t neural netw networks orks pro provide vide a way to extend deep learning to sequential differen t place, compared to the previous step. data. They are the last ma major jor to tool ol in our deep learning toolb toolboox. Our discussion now Recurren t neural netw orksuse prothese vide ato wols ay and to extend learning to real-world sequential mo moves ves to ho how w to choose and tools ho how w todeep apply them to data. They are the last ma jor tool in our deep learning toolb ox. Our discussion now tasks. moves to how to choose and use these tools and how to apply them to real-world tasks.
423
Chapter 11 Chapter 11
Practical metho methodology dology Practical metho dology Successfully applying deep learning tec techniques hniques requires more than just a go goo od kno knowledge wledge of what algorithms exist and the principles that explain ho how w they Successfully learning techniques just a gooan d w ork. A go goo odapplying mac machine hinedeep learning practitioner also requires needs to more kno know w than ho how w to choose knowledgefor of awhat algorithms exist and andho the thatrespond explaintoho w they algorithm particular application how w toprinciples monitor and feedback w ork. A go o d mac hine learning practitioner also needs to kno w ho w to choose an obtained from exp experimen erimen eriments ts in order to improv improvee a machine learning system. During algorithm aelopmen particular and howsystems, to monitor and respond to to feedback da day y to dayfor dev developmen elopment t ofapplication machine learning practitioners need decide obtained to from experimen ts in order to improv e a machine During whether gather more data, increase or decrease mo model del learning capacity capacity,,system. add or remo remov ve da y to day dev elopmen t of machine learning systems, practitioners need to decide regularizing features, impro improv ve the optimization of a mo model, del, improv improvee appro approximate ximate whether to gather more data, increase or decrease mo del capacity , add or remo inference in a mo model, del, or debug the softw software are implemen implementation tation of the mo model. del. All vofe regularizing features, impro v e the optimization of a mo del, improv e appro ximate these op operations erations are at the very least time-consuming to try out, so it is imp important ortant inference model, or the debug thecourse softwof areaction implemen tation the moguessing. del. All of to be ableintoadetermine right rather thanofblindly these operations are at the very least time-consuming to try out, so it is important Most of this bo ok is ab about out different machine learning mo models, dels, training algoto be able to determine the right course of action rather than blindly guessing. rithms, and ob objective jective functions. This ma may y give the impression that the most Most of this b o ok is ab out different machine learning mowing dels, atraining algoimp importan ortan ortantt ingredien ingredientt to being a machine learning exp expert ert is kno knowing wide variet ariety y rithms, andlearning objective functions. yodgive the impression of machine tec techniques hniques and This beingma go goo at differen different t kinds of that math.the In most pracimportan ingredien to bmuc eingh abetter machine exp ert is knowing wide variety tice, one tcan usuallyt do much withlearning a correct application of a acommonplace of machine learning tec hniques and b eing go o d at differen t kinds of math. In pracalgorithm than by sloppily applying an obscure algorithm. Correct application of tice, one can usually do muc h b etter with a correct application of a commonplace an algorithm dep depends ends on mastering some fairly simple metho methodology dology dology.. Man Many y of the algorithm than byin sloppily applying obscure algorithm. Correct application of recommendations this chapter arean adapted from Ng (2015 ). an algorithm depends on mastering some fairly simple methodology. Many of the We recommend the follo following wing practical design pro process: cess: recommendations in this chapter are adapted from Ng (2015). recommend thegoals—what following practical designtopro cess: your error metric use, and your target value for •WeDetermine this error metric. These goals and error metrics should be driv driven en by the Determine your goals—what error metric to use, and y our target value for problem that the application is in intended tended to solve. this error metric. These goals and error metrics should b e driv en by the • a working end-to-end pip pipeline eline to as solve. so soon on as p ossible, including the • Establish problem that the application is in tended 424eline as soon as p ossible, including the Establish a working end-to-end pip
•
424
CHAPTER 11. PRACTICAL METHODOLOGY
estimation of the appropriate performance metrics. • Instrumen estimationt of the appropriate erformancebottlenecks metrics. in performance. DiagInstrument the system well topdetermine nose which comp components onents are performing worse than exp expected ected and whether it Instrumen t erfitting, the system well to determine bottlenecks in por erformance. is due to ov overfitting, underfitting, or a defect in the data softw software. are. Diag• nose which components are performing worse than expected and whether it is due to overfitting, underfitting, or asuch defect the datanew or softw Repeatedly eatedly mak makee incremental changes as in gathering data,are. adjusting • Rep hyp yperparameters, erparameters, or changing algorithms, based on sp specific ecific findings from Rep make incremental changes such as gathering new data, adjusting y oureatedly instrumentation. • hyperparameters, or changing algorithms, based on specific findings from instrumentation. Asyour a running example, we will use Street View address num umb ber transcription system (Go Goo odfellow et al., 2014d). The purp purpose ose of this application is to add As a running example, w e will use Street View address umber transcription buildings to Go Google ogle Maps. Street View cars photograph thenbuildings and record system ( Go o dfellow et al. , 2014d ). The purp ose of this application towork add the GPS co coordinates ordinates asso associated ciated with eac each h photograph. A conv convolutional olutionalisnet network buildings tothe Goaddress ogle Maps. Street photograph the buildings and recognizes num umb ber inView eachcars photograph, allo allowing wing the Go Google ogle record Maps the GPS co ordinates asso ciated with eac h photograph. A conv olutional net database to add that address in the correct lo location. cation. The story of howwork this recognizes the address n um b er in each photograph, allo wing the Go ogle Maps commercial application was developed giv gives es an example of how to follow the design database to add that address in the correct location. The story of how this metho methodology dology we adv advo ocate. commercial application was developed gives an example of how to follow the design We no now w describ describee each of the steps in this pro process. cess. methodology we advocate. We now describe each of the steps in this process.
11.1
Performance Metrics
Determining your goals, in terms of whic which h error metric to use, is a necessary first 11.1 Performance Metrics step because your error metric will guide all of your future actions. Y You ou should Determining your goals, in terms of whic h error metric to use, is a necessary first also hav havee an idea of what level of performance you desire. step because your error metric will guide all of your future actions. You should that for level mostofapplications, is imp impossible ossible to ac achiev hiev hievee absolute alsoKeep have in anmind idea of what performanceityou desire. zero error. The Bay Bayes es error defines the minimum error rate that you can hop hopee to Keep in mind that for most applications, it is imp ossible to ac hiev e absolute ac achiev hiev hieve, e, ev even en if you hav havee infinite training data and can reco recov ver the true probabilit probability y zero error. The Bay es error defines the minimum error rate that you can hop e to distribution. This is because y your our input features ma may y not con contain tain complete ac hiev e, ev en if you hav e infinite training data and can reco v er the true probability information ab about out the output variable, or because the system migh mightt be intrinsically distribution. because yourbyinput features y not tain complete sto will is also be limited ha a finite ma amoun training data. stocchastic. YouThis having ving amount t ofcon information about the output variable, or because the system might be intrinsically The amount of training data can be limited for a variet ariety y of reasons. When your stochastic. You will also be limited by having a finite amount of training data. goal is to build the best possible real-world pro product duct or service, you can typically The amount of training data can b e limited for a varietyerror of reasons. your collect more data but must determine the value of reducing furtherWhen and weigh goal against is to build possible real-world proData duct or service, can you require can typically this thethe costbest of collecting more data. collection time, collect more data but m ust determine the v alue of reducing error further and weigh money money,, or human suffering (for example, if your data collection pro process cess in inv volves this against the cost of collecting more data. Data collection can require time, performing in inv vasive medical tests). When your goal is to answer a scientific question money , or h uman suffering (for example, if your data collection pro cess in v olves ab about out which algorithm performs better on a fixed benc enchmark, hmark, the benc enchmark hmark performing invasive medical tests). When your goal is to answer a scientific question about which algorithm performs better on a fixed benchmark, the benchmark 425
CHAPTER 11. PRACTICAL METHODOLOGY
sp specification ecification usually determines the training set and you are not allow allowed ed to collect more data. specification usually determines the training set and you are not allowed to collect Ho How w can one determine a reasonable lev level el of performance to exp expect? ect? Typically Typically,, more data. in the academic setting, we hav havee some estimate of the error rate that is attainable Ho w can one determine a reasonable levelresults. of performance to expect?setting, Typically based on previously published benchmark In the real-word we, in vthe academic we hav e some of the rate that is to attainable ha hav e some idea setting, of the error rate that estimate is necessary forerror an application be safe, based on previously published b enchmark results. In the real-word setting, we cost-effectiv cost-effective, e, or app appealing ealing to consumers. Once you ha hav ve determined your realistic have some of ythe that iswill necessary for by an reaching application be safe, desired erroridea rate, our error designrate decisions be guided thisto error rate. cost-effective, or appealing to consumers. Once you have determined your realistic Another important ortant besides targetby value of thethis performance desired errorimp rate, your consideration design decisions will bthe e guided reaching error rate. metric is the choice of whic which h metric to use. Sev Several eral different performance metrics ortant consideration besides target v alue of the that performance ma may yAnother be usedimp to measure the effectiveness of athe complete application includes metric is the choice of whic h metric to use. Sev eral different p erformance metrics mac machine hine learning comp components. onents. These performance metrics are usually different ma y b e used to measure the effectiveness of a complete application that includes from the cost function used to train the mo model. del. As describ described ed in Sec. 5.1.2, it is machine to learning comp These performance metrics are usually different common measure theonents. accuracy accuracy, , or equiv equivalently alently alently,, the error rate, of a system. from the cost function used to train the model. As described in Sec. 5.1.2, it is Ho How wev ever, many applications require more adv advanced anced metrics. common toer,measure the accuracy , or equiv alently , the error rate, of a system. Sometimes it is muc much h more costly to make one kind of a mistake than another. However, many applications require more advanced metrics. For example, an e-mail spam detection system can make two kinds of mistak mistakes: es: Sometimes it is mucah legitimate more costlymessage to makeasone kindand of aincorrectly mistake than another. incorrectly classifying spam, allowing a F or example, an e-mail spam detection system can make t w o kinds of mistak es: spam message to app appear ear in the in inb box. It is muc much h worse to blo block ck a legitimate incorrectly classifying a legitimate message as spam, and incorrectly allowing message than to allow a questionable message to pass through. Rather thana spam message to app earofina the inbclassifier, ox. It iswe muc hyworse to measure block a some legitimate measuring the error rate spam ma may wish to form message thanwhere to allow a questionable message to pass through. Rather of total cost, the cost of blo blocking cking legitimate messages is higher than thethan cost measuring the error rate of a spam classifier, we ma y wish to measure some form of allowing spam messages. of total cost, where the cost of blocking legitimate messages is higher than the cost Sometimes we wish to train a binary classifier that is intended to detect some of allowing spam messages. rare even event. t. For example, we migh mightt design a medical test for a rare disease. Supp Suppose ose Sometimes we wish to train a binary classifier that is intended to detect some that only one in every million people has this disease. We can easily achiev achievee rare even t. F or example, we migh t design a medical test for a rare disease. Supp ose 99.9999% accuracy on the detection task, by simply hard-co hard-coding ding the classifier that only one in every million p eople has this disease. W e can easily achiev to alwa always ys rep report ort that the disease is absen absent. t. Clearly Clearly,, accuracy is a poor wa way y toe 99.9999% on the detection byOne simply the classifier characterizeaccuracy the p erformance of suc system. wa to solveding this problem is to such h a task, way y hard-co to alwa ys rep ort that the disease is absen t. Clearly , accuracy is a p o or wa y to instead measure pr preecision and recal alll. Precision is the fraction of detections rep reported orted characterize thethat p erformance of sucwhile h a system. One y to solve is to b y the mo model del were correct, recall is thewafraction of this trueproblem ev even en ents ts that instead measure pr e cision and r e c al l . Precision is the fraction of detections rep orted were detected. A detector that says no one has the disease would ac achiev hiev hievee perfect b y the mo del that were correct, while recall is the fraction of true eventswould that precision, but zero recall. A detector that sa says ys ev every ery eryone one has the disease were detected. detector says no onetohas disease would achievwho e perfect ac achiev hiev hieve e perfect A recall, but that precision equal thethe percentage of people ha have ve precision, but zero recall. A detector that sa ys ev ery one has the disease would the disease (0.0001% in our example of a disease that only one people in a million ac hiev perfectusing recall, but precision equalittois the percentage eople who, ha ve ha hav ve). eWhen precision and recall, common to plotofapPR curve with the diseaseon(0.0001% in and our example a disease one pgenerates eople in a amillion x-axis. that precision the y-axis recall onofthe The only classifier score have).is When recall, itoccurred. is common plot a PR curve, with that higher using if the precision even eventt to band e detected F For ortoexample, a feedforw feedforward ard precision on the y-axis and recall on the x-axis. The classifier generates a score that is higher if the event to be detected 426 o ccurred. For example, a feedforward
CHAPTER 11. PRACTICAL METHODOLOGY
net netw work designed to detect a disease outputs yˆ = P (y = 1 | x), estimating the probabilit probability y that a person whose medical results are describ described ed by features x has y ˆ = P ( y = 1 x), estimating net w ork designed to detect a disease outputs the the disease. We choose to rep report ort a detection whenever this score exceeds some x probability By that a person medical results areprecision described byrecall. features has | for threshold. varying thewhose threshold, we can trade In many the disease. W e choose to rep ort a detection whenever this score exceeds some cases, we wish to summarize the performance of the classifier with a single num number ber threshold. varyingTthe wecon canvert trade precision for recall recall.r In many p and rather thanBy a curve. o dothreshold, so, we can conv precision in into to an cases, we wish to summarize the p erformance of the classifier with a single num ber F-sc F-scor or oree given by rather than a curve. To do so, we can con 2prvert precision p and recall r into an . (11.1) F = F-score given by p+r 2pr . (11.1) F = Another option is to rep report ort the total area p +lying r beneath the PR curve. In some applications, it is possible for the mac machine hine learning system to refuse to Another option is to report the total area lying beneath the PR curve. mak makee a decision. This is useful when the mac machine hine learning algorithm can estimate some applications, iteisab possible for the mac hine learning systemdecision to refusecan to ho how wInconfident it should b about out a decision, esp especially ecially if a wrong mak e a decision. This is useful when the mac hine learning algorithm can estimate be harmful and if a human op operator erator is able to occasionally take ov over. er. The Street ho w confident it should b e ab out a decision, esp ecially if a wrong decision View transcription system provides an example of this situation. The task iscan to b e harmful and if a human operator able to occasionally over.the Thelo Street transcrib transcribe e the address num number ber from a is photograph in order to take asso associate ciate location cation View transcription example of this The task to where the photo wassystem tak taken en provides with the an correct address in asituation. map. Because the vis alue transcrib e the addressconsiderably number fromifathe photograph in order toitasso ciate the tloto cation of the map degrades map is inaccurate, is imp importan ortan ortant add where the photo was tak en with the correct address in a map. Because the v alue an address only if the transcription is correct. If the mac machine hine learning system of the map considerably if the map inaccurate, is importan t to add thinks that degrades it is less likely than a human beingis to obtain theitcorrect transcription, an address only if the transcription is correct. machine learning system then the best course of action is to allow a humanIftothe transcrib transcribe e the photo instead. thinks thatthe it ismachine less likely than asystem humanisbeing obtain Of course, learning only to useful if itthe is correct able to transcription, dramatically then the b est course of action is to allow a human to transcrib e the photo instead. reduce the amount of photos that the human op operators erators must pro process. cess. A natural Of course, the machine learning is only useful itvis ableistothe dramatically p erformance metric to use in thissystem situation is cover overage age age.. ifCo Cov erage fraction of reduce the amount of photos that the h uman op erators must pro cess. A onse. natural examples for which the machine learning system is able to pro produce duce a resp response. It p erformance metric to use in this situation is c over age . Co v erage is the fraction of is possible to trade co cov verage for accuracy accuracy.. One can alwa always ys obtain 100% accuracy examples for which the machine learning system is able to pro ducetoa 0%. response. It by refusing to pro process cess any example, but this reduces the cov coverage erage For the is possible tradethe covgoal erage Onetocan alwahys obtain el 100% accuracy Street Viewtotask, forfor theaccuracy pro project ject. was reach uman-lev uman-level transcription by refusing to pro cess any example, but this reduces the erage to 0%. Fortask the accuracy while main maintaining taining 95% cov coverage. erage. Human-lev Human-level el pcov erformance on this Street task, is 98% View accuracy accuracy. . the goal for the pro ject was to reach human-level transcription accuracy while maintaining 95% coverage. Human-level performance on this task Man Many y other .metrics are possible. We can for example, measure clic click-through k-through is 98% accuracy rates, collect user satisfaction surveys, and so on. Man Many y sp specialized ecialized application Man y eother metrics are possible. areas hav have application-sp application-specific ecific criteriaWase can well.for example, measure click-through rates, collect user satisfaction surveys, and so on. Many sp ecialized application What is imp important ortant is to determine whic which h performance metric to improv improvee ahead areas have application-specific criteria as well. of time, then concen concentrate trate on improving this metric. Without clearly defined goals, What important is towhether determine which to performance to improv it can be isdifficult to tell changes a machinemetric learning systeme ahead make of time, then concen trate on improving this metric. Without clearly defined goals, progress or not. it can be difficult to tell whether changes to a machine learning system make progress or not. 427
CHAPTER 11. PRACTICAL METHODOLOGY
11.2
Default Baseline Mo Models dels
After hoosing osing performance metrics goals, the next step in any practical 11.2 cho Default Baseline Moand dels application is to establish a reasonable end-to-end system as so soon on as possible. In After c ho osing p erformance metrics and goals, the next step anyaspractical this section, we provide recommendations for which algorithms toinuse the first application is to establish a reasonable end-to-end system as so on as p ossible. In baseline approach in various situations. Keep in mind that deep learning research this section, we provide recommendations for which algorithms to use as the so first progresses quickly quickly, , so better default algorithms are likely to become available soon on baseline approach in v arious situations. Keep in mind that deep learning research after this writing. progresses quickly, so better default algorithms are likely to become available soon Depending ending on the complexity of your problem, you may ev even en wan antt to begin afterDep this writing. without using deep learning. If your problem has a chance of being solv solved ed by endinga on complexity of your problem, may en wwith ant to begin justDep choosing fewthe linear weigh weights ts correctly correctly, , you mayyou wan want t to bevegin a simple without using deep If your problem has a chance of being solved by statistical mo model del likelearning. logistic regression. just choosing a few linear weights correctly, you may want to begin with a simple If you know that your problem falls in into to an “AI-complete” category like ob object ject statistical model like logistic regression. recognition, sp speech eech recognition, machine translation, and so on, then you are likely If you that yourwith problem falls into an “AI-complete” category like ob ject to do well know by beginning an appropriate deep learning mo model. del. recognition, speech recognition, machine translation, and so on, then you are likely First, hoose the general category of mo model del based on mo thedel. structure of your to do well cbho y ose beginning with an appropriate deep learning data. If you wan antt to perform sup supervised ervised learning with fixed-size vectors as input, choose the general category of model based your use First, a feedforward netw network ork with fully connected lay layers. ers.on If the the structure input has of known data. If youstructure want to p(for erform supervised fixed-size as input, top topological ological example, if thelearning input is with an image), usevectors a conv convolutional olutional use feedforward network fully connected laysome ers. If theofinput has known net netw waork. In these cases, youwith should begin by using kind piecewise linear top ological structure (for example, if the input is an image), use a conv olutional unit (ReLUs or their generalizations lik likee Leaky ReLUs, PreLus and maxout). If net work. cases, shoulduse begin by using some net kind(LSTM of piecewise linear your inputInorthese output is ayou sequence, a gated recurrent or GRU). unit (ReLUs or their generalizations like Leaky ReLUs, PreLus and maxout). If A reasonable choice of optimization algorithm is SGD with momentum with a your input or output is a sequence, use a gated recurrent net (LSTM or GRU). deca decaying ying learning rate (p (popular opular decay sc schemes hemes that perform better or worse on A reasonable choice of optimization algorithm is SGDa with with a differen differentt problems include deca decaying ying linearly until reaching fixed momentum minimum learning deca learning rate (popular decay schemes that perform better or worse on rate,ying decaying exp the learning rate by a factor of 2-10 exponentially onentially onentially, , or decreasing differen t problems include ying linearly until reaching a fixed minimumislearning eac each h time validation error deca plateaus). Another very reasonable alternative Adam. rate, decaying exp onentially , or decreasing the learning rate by a factor of 2-10 Batc Batch h normalization can hav havee a dramatic effect on optimization performance, eac h time v alidation error plateaus). Another very reasonable alternative is Adam. esp especially ecially for con conv volutional netw networks orks and netw networks orks with sigmoidal nonlinearities. Batc h normalization can hav e a dramatic effect optimization performance, While it is reasonable to omit batch normalization on from the very first baseline, it especially for convolutional orks and netw orks nonlinearities. should be introduced quic quickly klynetw if optimization app appears earswith to bsigmoidal e problematic. While it is reasonable to omit batch normalization from the very first baseline, it Unless your training tens ofapp millions or more, you should be introduced quicset kly contains if optimization ears toofbexamples e problematic. should include some mild forms of regularization from the start. Early stopping Unless youralmost training set contains tens of texamples more, you should be used universally universally. . Drop Dropout out of is millions an excellen excellent regularizerorthat is easy should includeand some mild forms of regularization fromtraining the start. Early stopping to implement compatible with many models and algorithms. Batch should b e used almost universally . Drop out is an excellen t regularizer that is easy normalization also sometimes reduces generalization error and allo allows ws drop dropout out to to implement and compatible with many models and training algorithms. Batch be omitted, due to the noise in the estimate of the statistics used to normalize normalization eac each h variable. also sometimes reduces generalization error and allows dropout to be omitted, due to the noise in the estimate of the statistics used to normalize each variable. 428
CHAPTER 11. PRACTICAL METHODOLOGY
If your task is similar to another task that has been studied extensively extensively,, you will probably do well by first copying the mo model del and algorithm that is already If your task is bsimilar to another taskstudied that has been extensively youy kno known wn to perform est on the previously task. Youstudied may even wan wantt to ,cop copy probably do from well that by first copying the model algorithm that already awill trained mo model del task. For example, it isand common to use theis features kno wn to p erform b est on the previously studied task. Y ou may even wan t copy from a conv convolutional olutional netw network ork trained on ImageNet to solve other computertovision a trained mo deletfrom that).task. For example, it is common to use the features tasks (Girshick al., 2015 from a convolutional network trained on ImageNet to solve other computer vision A common question is whether to begin by using unsup unsupervised ervised learning, detasks (Girshick et al., 2015). scrib scribed ed further in Part III. This is somewhat domain sp specific. ecific. Some domains, suc such h A common question is whether to b egin by using unsup ervised learning, deas natural language pro processing, cessing, are known to benefit tremendously from unsup unsupererscribed furthertechniques in Part IIIsuch . Thisasislearning somewhat domain spword ecific.embeddings. Some domains, such vised learning unsupervised In other as naturalsuch language processing, are known benefit tremendously from unsupdo erdomains, as computer vision, currenttounsup unsupervised ervised learning techniques visedbring learning techniques such learning unsupervised word embeddings. Inber other not a benefit, except inasthe semi-sup semi-supervised ervised setting, when the num number of domains, such as computer vision, current unsup ervised learning techniques do lab labeled eled examples is very small (Kingma et al., 2014; Rasm Rasmus us et al. al.,, 2015). If your not bring a isbenefit, except in theunsup semi-sup ervised setting, when ber of application in a con context text where unsupervised ervised learning is kno known wn tothe be num imp important, ortant, labeled examples very small (Kingmabaseline. et al., 2014 ; Rasmusonly et al. , 2015 ). ervised If your then include it in is your first end-to-end Otherwise, use unsup unsupervised application is in afirst conattempt text where unsup ervised learning is kno to bervised. e important, learning in your if the task you wan want t to solve iswn unsup unsupervised. You then include it in your first end-to-end baseline. Otherwise, only use unsup ervised can alwa always ys try adding unsup unsupervised ervised learning later if you observ observee that your initial learning ov in erfits. your first attempt if the task you want to solve is unsupervised. You baseline overfits. can always try adding unsupervised learning later if you observe that your initial baseline overfits.
11.3
Determining Whether to Gather More Data
After first end-to-end system is established, it is timeMore to measure the perfor11.3 theDetermining Whether to Gather Data mance of the algorithm and determine how to impro improv ve it. Man Many y machine learning After end-to-end system isemen established, it isout time to measure the perforno novices vicesthe arefirst tempted to mak make e improv improvemen ements ts by trying many different algorithms. mance of the and determine howmore to impro e it. Man y machine Ho How wev ever, er, it is algorithm often muc much h better to gather datavthan to improv improve e the learning learning no vices are tempted to mak e improv emen ts by trying out many different algorithms. algorithm. However, it is often much better to gather more data than to improve the learning Ho How w do does es one decide whether to gather more data? First, determine whether algorithm. the performance on the training set is acceptable. If performance on the training doesthe one decide algorithm whether toisgather morethe data? First,data determine set Ho is pwoor, learning not using training that is whether already the p erformance on the training set is acceptable. If p erformance on the training available, so there is no reason to gather more data. Instead, try increasing the set isofpthe oor,mo the algorithm training that is already size model dellearning by adding more la lay yisersnot or using addingthe more hiddendata units to each la lay yer. a v ailable, so there is no reason to gather more data. Instead, try increasing the Also, try improving the learning algorithm, for example by tuning the learning rate size of the model bIfy large adding more layers or adding moreoptimization hidden unitsalgorithms to each laydo er. h yp yperparameter. erparameter. mo models dels and carefully tuned Also, try improving the learning algorithm, for example b y tuning the learning rate not work well, then the problem migh mightt be the qualit quality y of the training data. The hyperparameter. If large dels carefully tunedinputs optimization do data may be to too o noisy or mo may notand include the right needed algorithms to predict the qualitycleaner not work well, then problem mightov beer,the of the training data. Thea desired outputs. Thisthe suggests starting over, collecting data or collecting data befeatures. too noisy or may not include the right inputs needed to predict the ric richer hermay set of desired outputs. This suggests starting over, collecting cleaner data or collecting a If the performance richer set of features. on the training set is acceptable, then measure the perIf the performance on the training429 set is acceptable, then measure the per-
CHAPTER 11. PRACTICAL METHODOLOGY
formance on a test set. If the performance on the test set is also acceptable, then there is nothing left to be done. If test set p erformance is muc much h worse than formance on a test set. If the p erformance on the test set is also acceptable, training set performance, then gathering more data is one of the most effective then there is nothing left to b e done. If test set p erformance is muc h w orse than solutions. The key considerations are the cost and feasibilit feasibility y of gathering more training set p erformance, then gathering more data is one of the most effective data, the cost and feasibility of reducing the test error by other means, and the solutions. the cost and feasibilit y of set gathering more amoun amountt of The data key thatconsiderations is exp expected ected toare be necessary to improv improve e test performance data, the tly cost and feasibility reducing the test error by means, anditthe significan significantly tly. . A At t large internetofcompanies with millions or other billions of users, is amoun t of data that is exp ected to b e necessary to improv e test set p erformance feasible to gather large datasets, and the exp expense ense of doing so can be considerably significan tly . A t large internet companies with millions billions users,more it is less than the other alternativ alternatives, es, so the answer is almostoralwa always ys to ofgather feasible to gather datasets, and the exptense of doing so can be considerably training data. Forlarge example, the developmen development of large lab labeled eled datasets was one of less than the other alternativ es, so the answer is almost alwa ys to gather more the most imp importan ortan ortantt factors in solving ob object ject recognition. In other contexts, such as training data. F or example, the developmen t of large lab eled datasets was one of medical applications, it may be costly or infeasible to gather more data. A simple the most imp ortan t factors in solving ob ject recognition. In other contexts, such as alternativ alternativee to gathering more data is to reduce the size of the mo model del or impro improv ve medical applications, it may bhyperparameters e costly or infeasible more data. A simple regularization, by adjusting suchtoasgather weigh eight t decay co coefficients, efficients, alternativ e to regularization gathering more data is to reduce the out. size of the mo or impro ve or by adding strategies such as drop dropout. If you finddelthat the gap regularization, by adjusting hyperparameters such as weigh tendecay efficients, b et etw ween train and test performance is still unacceptable ev even after co tuning the or by adding regularization strategies such as drop out. If you find that the gap regularization hyperparameters, then gathering more data is advisable. between train and test performance is still unacceptable even after tuning the When deciding whether to gather more data, it is also necessary to decide regularization hyperparameters, then gathering more data is advisable. ho how w muc much h to gather. It is helpful to plot curves sho showing wing the relationship bet etw ween When deciding whether to gather more data, it is also necessary to decide training set size and generalization error, like in Fig. 5.4. By extrap extrapolating olating such ho w muc h to gather. It is helpful to plot curves sho wing the relationship between curv curves, es, one can predict how muc much h additional training data would be needed to training size and error, like, in Fig. 5.4 . By fraction extrapolating such ac achiev hiev hievee aset certain levelgeneralization of performance. Usually Usually, adding a small of the total curv can predict howhav muc additionalimpact training would be error. neededIt to n um umb bes, er one of examples will not have e ah noticeable ondata generalization is ac hiev e a certain level of p erformance. Usually , adding a small fraction of the total therefore recommended to exp experiment eriment with training set sizes on a logarithmic scale, n um b er of examples will not hav e aofnoticeable onconsecutive generalization error. Itts.is for example doubling the num number ber examples impact betw etween een exp experimen erimen eriments. therefore recommended to experiment with training set sizes on a logarithmic scale, If gathering muc much h more data is not feasible, the only other wa way y to impro improv ve for example doubling the number of examples between consecutive experiments. generalization error is to improv improvee the learning algorithm itself. This becomes the If gathering muc h more data is not of feasible, only other way to improve domain of research and not the domain advice the for applied practitioners. generalization error is to improve the learning algorithm itself. This becomes the domain of research and not the domain of advice for applied practitioners.
11.4
Selecting Hyp Hyperparameters erparameters
Most learning algorithms come with many hyp yperparameters erparameters that control many 11.4deepSelecting Hyp erparameters asp aspects ects of the algorithm’s beha ehavior. vior. Some of these hyp yperparameters erparameters affect the time Most deep learning algorithms come with many h yp erparameters that control affect many and memory cost of running the algorithm. Some of these hyperparameters aspects of the beha vior. by Some these hpro yperparameters affect the time the quality of algorithm’s the mo model del reco recov vered the of training process cess and its ability to infer and memory of running correct resultscost when deploy deployed edthe onalgorithm. new inputs.Some of these hyperparameters affect the quality of the model recovered by the training process and its ability to infer There are tw o basic approac approaches to inputs. cho hoosing osing these hyperparameters: choosing correct results when deploy ed onhes new them manually and cho hoosing osing them automatically automatically.. Cho Choosing osing the hyperparameters There are two basic approaches to choosing these hyperparameters: choosing 430 them manually and choosing them automatically . Choosing the hyperparameters
CHAPTER 11. PRACTICAL METHODOLOGY
man manually ually requires understanding what the hyp yperparameters erparameters do and how machine learning mo models dels achiev achievee go goo od generalization. Automatic hyp yperparameter erparameter selection man ually requires understanding what the h yp erparameters dobut andthey howare machine algorithms greatly reduce the need to understand these ideas, often learning mo dels achiev e go o d generalization. Automatic h yp erparameter selection muc uch h more computationally costly costly.. algorithms greatly reduce the need to understand these ideas, but they are often much more computationally costly.
11.4.1
Man Manual ual Hyp Hyperparameter erparameter Tuning
11.4.1 Manual Hypman erparameter Tuning T o set hyperparameters manually ually ually,, one must understand the relationship b et etween ween hyp yperparameters, erparameters, training error, generalization error and computational resources To set hyperparameters , oneestablishing must understand relationship etween (memory and run runtime). time). man Thisually means a solidthe foundation on bthe funh yp erparameters, training error, generalization error and computational resources damen damental tal ideas concerning the effective capacit capacity y of a learning algorithm from (memory and run time). This means establishing a solid foundation on the funChapter 5. damental ideas concerning the effective capacity of a learning algorithm from The goal manual ual hyp yperparameter erparameter searc search h is usually to find the lo low west generalChapter 5. of man ization error sub subject ject to some runtime and memory budget. We do not discuss how The goal of man ual hypand erparameter h is of usually to find the lowest generalto determine the runtime memorysearc impact various hyp yperparameters erparameters here ization error sub ject to some runtime and memory budget. W e do not discuss how because this is highly platform-dep platform-dependent. endent. to determine the runtime and memory impact of various hyperparameters here The primary goal of manual hyp yperparameter erparameter search is to adjust the effective because this is highly platform-dep endent. capacit capacity y of the mo model del to match the complexity of the task. Effective capacity The primary goal of factors: manual the hyperparameter search is to yadjust effective is constrained by three representational capacit capacity of thethe mo model, del, the capacit y of the mo del to match the complexity of the task. Effective capacity abilit ability y of the learning algorithm to successfully minimize the cost function used to is constrained three representational capacit of the mo del, the train the mo model, del,byand the factors: degree tothe whic which h the cost function andy training pro procedure cedure ability of the to more successfully minimize the costunits function used to regularize the learning mo model. del. Aalgorithm mo model del with la lay yers and more hidden per la lay yer has train the mo del, and the degree to whic h the cost function and training pro cedure higher representational capacit capacity—it y—it is capable of representing more complicated regularize the mo del. A mo del with more laylearn ers and hidden units perthough, layer has functions. It can not necessarily actually allmore of these functions if higher representational capacity—it of representing the training algorithm cannot disco discov visercapable that certain functions more do a complicated go goo od job of functions. It not necessarily actually learnterms all of suc these though, if minimizing thecan training cost, or if regularization such h asfunctions weigh weightt deca decay y forbid the training some of thesealgorithm functions.cannot discover that certain functions do a good job of minimizing the training cost, or if regularization terms such as weight decay forbid The generalization error typically follows a U-shap U-shaped ed curve when plotted as some of these functions. a function of one of the hyp yperparameters, erparameters, as in Fig. 5.3. At one extreme, the The generalization error typically follows a yU-shap ed curve whenerror plotted as hyp yperparameter erparameter value corresp corresponds onds to low capacit capacity , and generalization is high a ecause function of one of the hyperparameters, as in Fig. 5.3. A t the one other extreme, the b training error is high. This is the underfitting regime. At extreme, h yp erparameter v alue corresp onds to low capacit y , and generalization error is high the hyperparameter value corresp corresponds onds to high capacity capacity,, and the generalization because training error the is high. This is the underfitting regime. other extreme, error is high because gap b et etw ween training and test errorAt is the high. Somewhere thethe hyperparameter alue corresp onds to high capacity , and the lo generalization in middle lies thevoptimal mo model del capacit capacity y, which achiev achieves es the low west possible error is high b ecause the gap b et w een training and test error is high. Somewhere generalization error, by adding a medium generalization gap to a medium amount in the middle lies the optimal mo del capacit y , which achiev es the lo w est p ossible of training error. generalization error, by adding a medium generalization gap to a medium amount For some hyp yperparameters, erparameters, overfitting occurs when the value of the hyperof training error. parameter is large. The num umb ber of hidden units in a lay layer er is one such example, For some hyperparameters, overfitting occurs when the value of the hyperparameter is large. The number of hidden 431 units in a layer is one such example,
CHAPTER 11. PRACTICAL METHODOLOGY
because increasing the num number ber of hidden units increases the capacit capacity y of the mo model. del. For some hyperparameters, ov overfitting erfitting occurs when the value of the hyp yperparameerparamebecause increasing the numthe ber smallest of hiddenallow unitsable increases capacit y of the of mozero del. ter is small. For example, allowable weigh eighttthe decay coefficient For some hyperparameters, erfittingcapacit occursy when value algorithm. of the hyperparamecorresp corresponds onds to the greatest ov effective capacity of thethe learning ter is small. For example, the smallest allowable weight decay coefficient of zero Not every hyperparameter will be able to explore the entire U-shaped curv curve. e. corresponds to the greatest effective capacity of the learning algorithm. Man Many y hyperparameters are discrete, such as the num numb ber of units in a lay layer er or the Not every hyperparameter will bunit, e able the entire U-shaped e. num umb ber of linear pieces in a maxout so to it isexplore only possible to visit a few pcurv oin oints ts Many the hyperparameters areerparameters discrete, such the num ber of these units hinyperparameters a layer or the along curve. Some hyp yperparameters areasbinary binary. . Usually n um b er of linear pieces in a maxout unit, so it is only p ossible to visit a few poinof ts are switc switches hes that specify whether or not to use some optional comp component onent along the curve. Some hyp erparameters arecessing binary.step Usually hyperparameters the learning algorithm, such as a prepro preprocessing thatthese normalizes the input are switc hes that specify whether or not to use some optional comp onent of features by subtracting their mean and dividing by their standard deviation. These the learning algorithm, as atwprepro cessing that normalizes the input h yp yperparameters erparameters can onlysuch explore o points on thestep curv curve. e. Other hyp yperparameters erparameters features by subtracting their mean and dividing by their standard deviation. ha hav ve some minim minimum um or maximum value that preven prevents ts them from exploringThese some hyperparameters explore o points on thetcurv e.y Other hyperparameters part of the curv curve. e. can Foronly example, thetwminimum weigh weight deca decay coefficient is zero. This have some or is maximum value thatweigh preven ts them fromwexploring means that minim if the um model underfitting when weight t decay is zero, e can not some en enter ter part of the curv e. F or example, the minimum weigh t deca y coefficient is zero. This the ov overfitting erfitting region by mo modifying difying the weigh weightt deca decay y co coefficient. efficient. In other words, means that if the model is underfitting when weigh t decay is zero, we can not enter some hyperparameters can only subtract capacit capacity y. the overfitting region by modifying the weight decay coefficient. In other words, The learning rate is perhaps the most imp important ortant hyperparameter. If you some hyperparameters can only subtract capacity. ha hav ve time to tune only one hyperparameter, tune the learning rate. It conis perhaps most ortant hyperparameter. you trolsThe thelearning effectiverate capacity of the the mo model del in aimp more complicated wa way y thanIfother ha v e time to tune only one hyperparameter, tune the learning rate. It conhyp yperparameters—the erparameters—the effective capacity of the mo model del is highest when the learning trols isthe effective of the moproblem, del in a not more complicated wayrate thanis other rate forcapacity the optimization when the learning esp espeecorrect h yp erparameters—the effective capacity of the mo del is highest when the learning cially large or esp especially ecially small. The learning rate has a U-shap U-shaped ed curve for tr training aining correct rate is for the optimization problem, not when the learning rate is espe-t error, illustrated in Fig. 11.1. When the learning rate is to tooo large, gradient descen descent cially large or esp ecially small. The learning rate has a U-shap ed curve for tr aining can inadverten inadvertently tly increase rather than decrease the training error. In the idealized error, illustrated in Fig. 11.1.ifWhen the learning too large, quadratic case, this occurs the learning rate rate is atis least twicegradient as largedescen as itst can inadverten increase than decrease training the idealized optimal value (tly LeCun et al.rather , 1998a ). When the the learning rateerror. is to too oIn small, training quadratic case, this o ccurs if the learning rate is at least twice as large its is not only slo slow wer, but may become permanently stuc stuck k with a high training as error. optimal v alue ( LeCun et al. , 1998a ). When the learning rate is to o small, training This effect is p oorly understo understoo od (it would not happ happen en for a conv convex ex loss function). is not only slower, but may become permanently stuck with a high training error. Tuning the parameters other than the learning rate requires monitoring both This effect is p oorly understood (it would not happen for a convex loss function). training and test error to diagnose whether your mo model del is overfitting or underfitting, T uning the parameters other than the learning rate requires monitoring both then adjusting its capacity appropriately appropriately.. training and test error to diagnose whether your model is overfitting or underfitting, If your error on the training set is higher than your target error rate, you hav havee then adjusting its capacity appropriately. no choice but to increase capacit capacity y. If you are not using regularization and you are If your erroryour on the training set is higheristhan your target error, then rate, yyou have confiden confident t that optimization algorithm performing correctly correctly, ou must no choice to increase capacit If add you more are not using units. regularization and you are add more but lay layers ers to your netw network orky. or hidden Unfortunately Unfortunately, , this confiden t that your optimization algorithm is p erforming correctly , then y ou must increases the computational costs asso associated ciated with the mo model. del. add more layers to your network or add more hidden units. Unfortunately, this If your error on the test set is higher than than your target error rate, you can increases the computational costs associated with the model. If your error on the test set is higher432 than than your target error rate, you can
CHAPTER 11. PRACTICAL METHODOLOGY
8
Training error
7 6 5 4 3 2 1 0 10−2
10−1
100
Learning rate (logarithmic scale)
Figure 11.1: Typical relationship betw etween een the learning rate and the training error. Notice the sharp rise in error when the learning is abov abovee an optimal value. This is for a fixed Figure 11.1: Typical relationship b etw een the rate only and the error. Notice training time, as a smaller learning rate maylearning sometimes slowtraining do down wn training by a the sharp rise in error when the learning is abov e an optimal v alue. This is for a fixed factor proportional to the learning rate reduction. Generalization error can follow this training as a smaller rate may sometimes slow downatraining byor a curv curve e or time, be complicated bylearning regularization effects arising only out of having to too o large factor proportional to the learning reduction. Generalization error can follow thist to too o small learning rates, since po poor or rate optimization can, to some degree, reduce or preven prevent e or beand complicated regularization effects error arising ofe different having ageneralization too large or ocurv verfitting, ev even en pointsbywith equiv equivalen alen alentt training canout hav have too small learning rates, since poor optimization can, to some degree, reduce or prevent error. overfitting, and even points with equivalent training error can have different generalization error.
no now w take tw two o kinds of actions. The test error is the sum of the training error and the gap betw etween een training and test error. The optimal test error is found by trading no w take tw o kinds ofNeural actions.netw Theorks testtypica error lly is the sum ofbest the when training and off these quantities. networks typically perform theerror training the gap between andwhen test error. Theis optimal testthe error foundisbprimarily y trading error is very lo low w training (and thus, capacity high) and testiserror off these quantities. Neural netw orks typica lly p erform b est when the training driv driven en by the gap betw etween een train and test error. Y Your our goal is to reduce this gap error is very low (and thus,error whenfaster capacity high) the testTerror is primarily without increasing training thanisthe gapand decreases. o reduce the gap, en bregularization y the gap betw een train and test error. effectiv Your goal isdel tocapacity reduce ,this cdriv hange hyperparameters to reduce effective e mo model capacity, suchgap as without increasing training error faster than the gap decreases. T o reduce the gap, by adding drop dropout out or weigh weightt deca decay y. Usually the best performance comes from a clarge hange regularization hyperparameters reduce b effectiv e mo delout. capacity, such as mo model del that is regularized well, fortoexample y using drop dropout. by adding dropout or weight decay. Usually the best performance comes from a Most hyperparameters can bweell, set for by example reasoningby ab about out whether they increase or large model that is regularized using dropout. decrease mo model del capacity capacity.. Some examples are included in Table 11.1. Most hyperparameters can be set by reasoning about whether they increase or While man manually ually tuning hyp yperparameters, erparameters, do not lose sight of your end goal: decrease model capacity. Some examples are included in Table 11.1. go goood performance on the test set. Adding regularization is only one wa way y to achiev achievee man uallyastuning hyp erparameters, do not of reduce your end goal: thisWhile goal. As long you hav have e lo low w training error, youlose can sight alwa always ys generalgood performance on the test set.training Addingdata. regularization is force only one to achieve ization error by collecting more The brute wa way ywa toypractically this goal. long is as to you haveually low training can yalwa reduce generalguaran guarantee tee As success contin continually increase error, mo model delyou capacit capacity andystraining set size ization error by collecting more training data. The brute force wa y to practically un until til the task is solved. This approach do does es of course increase the computational guaran success to continually modelgiv capacit y and training set size cost of tee training andisinference, so it isincrease only feasible given en appropriate resources. In until the task is solved. This approach does of course increase the computational cost of training and inference, so it is only 433 feasible given appropriate resources. In
CHAPTER 11. PRACTICAL METHODOLOGY
Hyp Hyperparameter erparameter Hyperparameter Num Numb ber of hidden units Number of hidden units Learning rate Learning rate
Increases capacit capacity y Increases when. . . capacity increased when. . . increased
tuned optimally tuned optimally
Con Conv volution kernel width Convolution kernel width
increased
Implicit zero padding Implicit zero padding W eight decay coefficien efficientt Weight decay coefficient Drop Dropout out rate
increased
Dropout rate
decreased
increased
increased decreased decreased decreased
Table 11.1: The effect
Reason
Ca Cav veats
Reason Increasing the num numb ber of hidden units increases the Increasing the num ber of represen representational tational capacity hidden units of the mo model. del. increases the representational capacity of the model. An improp improper er learning rate, whether to too o high or to too o An er learning rate, lo low, w,improp results in a mo model del whether too highcapacity or too with lo low w effective low,toresults in a failure model due optimization with low effective capacity Increasing the kernel width due to optimization failure increases the num number ber of paIncreasinginthe kernel width rameters the mo model del increases the number of parameters in the model
Caveats Increasing the number of hidden units increases Increasing number b oth the timethe and memory of hidden units increases cost of essentially every opboth theon time memory eration theand mo model. del. cost of essentially every operation on the model.
A wider kernel results in a narro narrower wer output dimenA wider kernel mo results in sion, reducing model del caa narro wer output dimenpacit pacity y unless you use imsion, reducing model plicit zero padding to carepacitythis unless you use imduce effect. Wider plicit padding rek ernelszero require more to memduce this effect. Wider ory for parameter storage kernels require more and increase run runtime, time,membut for wparameter storage aory narro narrow er output reduces and increase memory cost.runtime, but a narro w er output Increased time andreduces memmemory ory costcost. of most op operaeraIncreased time and memtions. ory cost of most operations.
Adding implicit zeros before conv convolution olution keeps the A dding implicit zeros represen representation tation size largebefore convolution keepst the Decreasing the weigh weight derepresen tation size large ca cay y coefficient frees the Decreasing the weigh demo model del parameters tot becay coefficient frees the come larger model parameters beDropping units lesstooften come larger giv gives es the units more opp oppororDropping less often tunities to units “conspire” with gives the units more oreac each h other to fit theopp traintunities to “conspire” with ing set each other to fit the trainofing various capacity.. set hyperparameters on model capacity
Table 11.1: The effect of various hyperparameters on model capacity.
434
CHAPTER 11. PRACTICAL METHODOLOGY
principle, this approach could fail due to optimization difficulties, but for many problems optimization do does es not seem to be a significant barrier, pro provided vided that the principle, this approach could fail due to optimization difficulties, but for many mo model del is chosen appropriately appropriately.. problems optimization does not seem to be a significant barrier, provided that the model is chosen appropriately.
11.4.2
Automatic Hyp Hyperparameter erparameter Optimization Algorithms
11.4.2 Automatic Hypjust erparameter Optimization The ideal learning algorithm takes a dataset and outputs aAlgorithms function, without requiring hand-tuning of hyperparameters. The p opularit opularity y of sev several eral learning The ideal learning just takes a dataset and outputs a function, without algorithms such as algorithm logistic regression and SVMs stems in part from their ability to requiring hand-tuning of hyperparameters. The p opularit y of sev eral learning perform well with only one or tw two o tuned hyperparameters. Neural netw networks orks can algorithms such as logistic regression and SVMs stems in part from their ability to sometimes perform well with only a small num numb ber of tuned hyperparameters, but p erform well with only one or tw o tuned hyperparameters. Neural netw orks can often benefit significantly from tuning of forty or more hyperparameters. Manual sometimes perform wellcan with onlyvery a small numbthe er ofuser tuned but h yp yperparameter erparameter tuning work well when has hyperparameters, a go goood starting point, often benefit significantly tuning of forty or on more Manual suc such h as one determined byfrom others having work orked ed thehyperparameters. same type of application hyperparameter canthe work very ell whenor the userof has a erience good starting point, and architecture,tuning or when user haswmonths years exp experience in exploring suc h as one determined by others having w ork ed on the same type of application hyp yperparameter erparameter values for neural netw networks orks applied to similar tasks. How However, ever, and many architecture, or when the user has months or years ofailable. experience exploring for applications, these starting points are not av available. In in these cases, hyperparameter valuescan forfind neural netw orks of applied to similar tasks. However, automated algorithms useful values the hyperparameters. for many applications, these starting points are not available. In these cases, If we think ab about out the ay inuseful whichvalues the user of hyperparameters. a learning algorithm searches automated algorithms canwfind of the for go goo od values of the hyperparameters, we realize that an optimization is taking If we about to thefind wayainvalue which of a learning algorithm searches place: wethink are trying of the theuser hyp yperparameters erparameters that optimizes an for go o d v alues of the hyperparameters, w e realize that an optimization is taking ob objectiv jectiv jectivee function, such as validation error, sometimes under constraints (suc (such h as a place: we are trying to find a v alue of the h yp erparameters that optimizes budget for training time, memory or recognition time). It is therefore possible, an in ob jectiv e function, such as v alidation error, sometimes under constraints (suc h as principle, to develop hyp hyperp erp erpar ar arameter ameter optimization algorithms that wrap a learninga budget forand training time, memory or recognition time).the It is therefore possible, in algorithm choose its hyperparameters, thus hiding hyperparameters of the principle,algorithm to developfrom hyperp arameter optimization algorithms that wrap a learning learning the user. Unfortunately Unfortunately, , hyperparameter optimization algorithm and choose hyperparameters, thus hiding hyperparameters the algorithms often hav havee its their own hyp yperparameters, erparameters, such the as the range of valuesofthat learning algorithm from the user. Unfortunately , hyperparameter optimization should be explored for eac each h of the learning algorithm’s hyperparameters. How However, ever, algorithms often hyperparameters have their own hypare erparameters, such the range of vsense alues that that these secondary usually easier toaschoose, in the should be explored for eac h ofbthe learning hyperparameters. Howsame ever, acceptable performance may e achiev achieved ed onalgorithm’s a wide range of tasks using the these secondary hyperparameters usually easier to choose, in the sense that secondary hyperparameters for all are tasks. acceptable performance may be achieved on a wide range of tasks using the same secondary hyperparameters for all tasks. 11.4.3
Grid Search
11.4.3thereGrid Search When are three or fewer hyperparameters, the common practice is to perform grid se sear ar arch ch ch.. For eac each h hyperparameter, the user selects a small finite set of values to When there three orhfewer hyperparameters, common practice to perform explore. Theare grid searc search algorithm then trains athe model for every jointisspecification grid search. For eachvalues hyperparameter, the user selects a small finite of vfor alues to of hyperparameter in the Cartesian pro product duct of the set of vset alues each explore. The grid searc h algorithm then trains a model for every joint specification individual hyperparameter. The exp experiment eriment that yields the best validation set of hyperparameter values in the Cartesian product of the set of values for each 435 individual hyperparameter. The experiment that yields the best validation set
Grid Imp Important ortant para parameter meter Imp Important ortant para parameter meter
Unimportant parameter
Unimp Unimportant ortant parameter
Random Layout Random Layout Unimportant parameter
Grid Layout Grid Layout Unimp Unimportant ortant parameter
Random Lay Layout out Random Lay Layout out
Unimportant parameter
Grid Lay Layout out Grid Lay Layout out
Unimp Unimportant ortant parameter
Unimportant parameter
Unimp Unimportant ortant parameter
CHAPTER 11. PRACTICAL METHODOLOGY
Random Imp Importa orta ortant nt parameter Imp Importa orta ortant nt parameter
Imp ortant para meter Important orta nt Importapurp nt parameter Figure 11.2: Comparison of para grid meter search and randomImp search. Forparameter illustration purposes oses we Grid Random displa display y two hyperparameters but we are typically interested in having many more. (L (Left) eft) Figure 11.2: Comparison of grid search and random search. F or illustration purp oses we To perform grid search, we pro provide vide a set of values for eac each h hyperparameter. The searc search h (L eft) display twruns o hyperparameters butjoint we are typically interested many more.of these algorithm training for every hyperparameter settingininhaving the cross pro product duct T o perform provide a set ofevprovide alues fora eac h hyperparameter. searc h sets. (Right)grid To search, performwe random searc search, h, w probability distributionThe ov over er joint algorithm runs training for every joint hyperparameter setting in the cross pro duct of these hyp yperparameter erparameter configurations. Usually most of these hyperparameters are independent (Right) sets. each To perform searc h,distribution we provide ov a probability distribution over joint from other. Commonrandom choices for the over er a single hyperparameter include hyperparameter configurations. Usually most of these hyperparameters are the independent exp of a uniform and log-uniform (to sample from a log-uniform distribution, take from each other. Common choices for the over a single hyperparameter include sample from a uniform distribution). Thedistribution searc search h algorithm then randomly samples joint exp uniform and log-uniform (to sample fromtraining a log-uniform distribution, of a h yp yperparameter erparameter configurations and runs with each of them. take Boththe grid search sample from asearch uniform distribution). The searc algorithm then randomly samples joint and random ev evaluate aluate the validation set herror and return the best configuration. hyperparameter configurations with each of them. hav Both search The figure illustrates the typical and case runs wheretraining only some hyperparameters have e agrid significant and random search evaluate validation setthe error and return the est horizon configuration. influence on the result. In thisthe illustration, only hyperparameter on bthe horizontal tal axis The illustrates theGrid typical casewastes wherean only somet hyperparameters haveis aexp significant has afigure significant effect. search amoun amount of computation that exponen onen onential tial influence on the result. In this illustration, only the hyperparameter on the horizon tal axis in the num number ber of non-influential hyp yperparameters, erparameters, while random search tests a unique has significant effect. Grid search wasteson annearly amounevery t of computation that is exponential v aluea of ev every ery influential hyperparameter trial. in the number of non-influential hyperparameters, while random search tests a unique value of every influential hyperparameter on nearly every trial.
436
CHAPTER 11. PRACTICAL METHODOLOGY
error is then chosen as having found the best hyperparameters. See the left of Fig. 11.2 for an illustration of a grid of hyperparameter values. error is then chosen as having found the best hyperparameters. See the left of Ho How w should the lists of values to searc search h over be chosen? In the case of numerical Fig. 11.2 for an illustration of a grid of hyperparameter values. (ordered) hyperparameters, the smallest and largest elemen elementt of each list is chosen Ho w should the lists of v alues to searc h o v er b e chosen? In the case numerical conserv conservatively atively atively,, based on prior exp experience erience with similar exp experimen erimen eriments, ts, toofmak make e sure (ordered) hyperparameters, the smallest element of each list is chosen that the optimal value is very likely toand be largest in the selected range. Typically Typically, , a conserv atively , based on prior exp erience with similar exp erimen ts, to mak e sure grid search inv involves olves picking values approximately on a logarithmic scale, e.g., a that the optimal alue isthe very to, 10 b−3 e in selected Typically , a −4 , 10 −5 01, , 10the learning rate tak taken envwithin setlikely a num number ber of hidden {.1, .01 } , orrange. logarithmic scale grid search inv olves picking v alues approximately on a , e.g., a units taken with the set {50, 100, 200, 500, 1000, 2000}. learning rate taken within the set .1, .01 , 10 , 10 , 10 , or a number of hidden Grid search usually p erforms b est when it is p erformed rep repeatedly eatedly.. For example, units taken with the set 50, 100,{200, 500, 1000, 2000 . } eatedly 1, 0, 1}. {−1 supp suppose ose that we ran a grid search ov over er a hyperparameter α using values of {− { } repeatedly. For example, Grid search usually p erforms b est when it is p erformed If the best value found is 1, then we underestimated the range in whic which h the best α α 1, 0, 1 . supp ose that w e ran a grid search ov er a hyperparameter using v alues of α lies and we should shift the grid and run another search with in, for example, α alue found 1, then underestimated thewrange whicto h the best {− } . Ifvw e find thatis the bestwe value of α is 0, then e mayinwish refine our , 2, 3b}est {If1the α lies and w e should shift the grid and run another search with in, for example, estimate by zo zooming oming in and running a grid search ov over er {−.1, 0, .1}. 1, 2, 3 . If we find that the best value of α is 0, then we may wish to refine our The obvious problem with grid search is that its computational grows ws estimate .1, 0, .1 . cost gro { } by zooming in and running a grid search over m exp exponen onen onentially tially with the num number ber of hyperparameters. If there are hyp yperparameters, erparameters, {− } obvious problem withthen grid the search isber that its computational cost gro ws n values, eac each hThe taking at most num number of training and ev evaluation aluation trials exponentially with of hyperparameters. there are mand hyperparameters, m). ber O (nnum required grows as the The trials ma may y be runIfin parallel exploit lo loose ose n eac h taking at most v alues, then the num ber of training and ev aluation trials parallelism (with almost no need for communication betw between een differen differentt machines O ( n required grows as ) . The trials ma y b e run in parallel and ofexploit loose carrying out the search) Unfortunately Unfortunately,, due to the exp exponent onent onential ial cost grid search, parallelism (with almost no need for acommunication betw een different machines ev even en parallelization may not provide satisfactory size of search. carrying out the search) Unfortunately, due to the exponential cost of grid search, even parallelization may not provide a satisfactory size of search.
11.4.4
Random Search
F ortunately ortunately, , there is an alternative to grid search that is as simple to program, more 11.4.4 Random Search con conv venien enientt to use, and conv converges erges muc much h faster to go goood values of the hyperparameters: Fortunately , there is an alternative to ,grid search that is as simple to program, more random search (Bergstra and Bengio 2012 ). convenient to use, and converges much faster to good values of the hyperparameters: A random searc search h pro proceeds ceeds as follo follows. ws. First we define a marginal distribution random search (Bergstra and Bengio, 2012). for each hyperparameter, e.g., a Bernoulli or multinoulli for binary or discrete A random searc ceeds as follows. First define afor marginal hyp yperparameters, erparameters, orh apro uniform distribution on awe log-scale positiv ositiveedistribution real-v real-valued alued for each hyperparameter, e.g., a Bernoulli or multinoulli for binary or discrete h yp yperparameters. erparameters. For example, hyperparameters, or a uniform distribution on a log-scale for positive real-valued _learning_rate ∼ u(−1, −5) (11.2) hyperparameters. Flog or example,
log_learning_rate . (11.3) log_learning_ _rate = 10 u( 1, 5) (11.2) rate ∼ − where u(a, b ) indicates a learning sample of_the uniform interval al(11.3) (a, b). = 10−distribution .in the interv (50),, Similarly the log_number_of_hidden_units ma may y be sampled from u(log (50) u ( a, b a, b). where ) indicates a sample of the uniform distribution in the interv al ( log(2000)) log(2000)).. log number of hidden units Similarly the _ _ _ _ may be sampled from u(log (50), 437 log(2000)).
CHAPTER 11. PRACTICAL METHODOLOGY
Unlik Unlikee in the case of a grid searc search, h, one should not discretize or bin the values of the hyp yperparameters. erparameters. This allows one to explore a larger set of values, and notillustrated discretize e in the case of computational a grid search, cost. one should or bin do does esUnlik not incur additional In fact, as in Fig. 11.2the ,a vrandom alues of search the hypcan erparameters. This allows one to explore a larger set of v alues, and be exp exponen onen onentially tially more efficient than a grid searc search, h, when there do es not incur additional computational cost. In fact, as illustrated in Fig.measure. 11.2, a are several hyp yperparameters erparameters that do not strongly affect the performance random search can be expinonen tially more efficient(2012 than),awho gridfound search,that when there This is studied at length Bergstra and Bengio random are several hyperparameters notmuc strongly affect performance measure. searc search h reduces the validationthat set do error much h faster thanthe grid search, in terms of This is studied at length in Bergstra and Bengio ( 2012 ), who found that random the num number ber of trials run by each metho method. d. search reduces the validation set error much faster than grid search, in terms of As with grid search, one may often wan antt to run rep repeated eated versions of random the number of trials run by each method. searc search, h, to refine the search based on the results of the first run. As with grid search, one may often want to run repeated versions of random The main reason why random searc search h finds go good od solutions faster than grid search search, to refine the search based on the results of the first run. is that the there are no wasted exp experimen erimen erimental tal runs, unlike in the case of grid search, The main reason why random searc h finds od solutions faster than grid search when tw two o values of a hyp yperparameter erparameter (given vgo alues of the other hyperparameters) isould thatgive the there are no wasted experimen runs, unlike the case of grid search, w the same result. In the case oftal grid search, theinother hyperparameters when o evalues of a vhalues yperparameter values of the other hyperparameters) w ould tw hav have the same for these t(given wo runs, whereas with random search, they w ould give the same result. In the case of grid search, the other hyperparameters would usually hav havee differen differentt values. Hence if the change betw etween een these tw two o values w ould hav e the same v alues for these t w o runs, whereas with random search, do does es not marginally make muc much h difference in terms of validation set error, they grid w ouldh usually have differen t veat alues. if the change betw een these two search values searc search will unnecessarily rep repeat tw two oHence equiv equivalen alen alent t exp experimen erimen eriments ts while random doesstill notgive marginally make mucexplorations h difference of inthe terms of hyperparameters. validation set error, grid will two indep independent endent other search will unnecessarily repeat two equivalent experiments while random search will still give two independent explorations of the other hyperparameters.
11.4.5
Mo Model-Based del-Based Hyp Hyperparameter erparameter Optimization
The searchMo fordel-Based go goo od hyperparameters can b e cast as an optimization problem. 11.4.5 Hyperparameter Optimization The decision variables are the hyp yperparameters. erparameters. The cost to be optimized is the The searchset forerror goodthat hyperparameters can b e cast an optimization problem. v alidation results from training usingasthese hyperparameters. In The decision v ariables are the h yp erparameters. The cost to b e optimized is the simplified settings where it is feasible to compute the gradient of some differentiable verror alidation set error results set from training these hyperparameters. In measure on thethat validation with resp respect ectusing to the hyperparameters, we can simplified settings where it is feasible to compute the gradient of some differentiable simply follow this gradien gradientt (Bengio et al., 1999; Bengio, 2000; Maclaurin et al., error).measure on the, in validation set with respectthis to gradient the hyperparameters, e can 2015 2015). Unfortunately Unfortunately, most practical settings, is una unav vailable,weither simply follow gradient (and Bengio et al.cost, , 1999or; Bengio ; Maclaurinhaving et al., due to its highthis computation memory due to ,h2000 yp yperparameters erparameters 2015 ). Unfortunately , in most practical settings, this gradient is una v ailable, either in intrinsically trinsically non-differentiable interactions with the validation set error, as in the due high computation and memory cost, or due to hyperparameters having case to of its discrete-v discrete-valued alued hyperparameters. intrinsically non-differentiable interactions with the validation set error, as in the To comp compensate ensate for this lac lack k of a gradien gradient, t, we can build a mo model del of the validation case of discrete-valued hyperparameters. set error, then prop propose ose new hyp yperparameter erparameter guesses by performing optimization T o comp ensate for this lac k of a gradien t, we can a mo del of the validation within this mo model. del. Most mo model-based del-based algorithms forbuild hyperparameter search use a set error, then prop ose new h yp erparameter guesses b y p erforming optimization Ba Bay yesian regression mo model del to estimate both the exp expected ected value of the validation set within this mohdel. Most model-based hyperparameter search use a error for each yp yperparameter erparameter and thealgorithms uncertaint uncertainty yforaround this exp expectation. ectation. OptiBayesian th regression mo to estimate bothexploration the expected value of the verparameters alidation set mization thus us inv involv olv olves es adeltradeoff betw etween een (prop (proposing osing hyp yperparameters error for each hyperparameter and the uncertainty around this expectation. Optimization thus involves a tradeoff between exploration (proposing hyperparameters 438
CHAPTER 11. PRACTICAL METHODOLOGY
for which there is high uncertaint uncertainty y, whic which h may lead to a large improv improvemen emen ementt but ma may y also perform poorly) and exploitation (prop (proposing osing hyperparameters whic which h the mo model del forconfiden which there high uncertaint y, any whichyperparameters h may lead to a large improv t but may is confident t willisperform as well as it has seen emen so far—usually also perform poorly) osingithyperparameters htemp the mo del h yp yperparameters erparameters thatand areexploitation very similar(prop to ones has seen before).whic Con Contemp temporary orary is confiden t will p erform as well as any hyperparameters it has seen so far—usually approac approaches hes to hyperparameter optimization include Sp Spearmin earmin earmintt (Snoek et al., 2012), hyperparameters are very ones itethas b). efore). Contemporary TPE (Bergstra et that al., 2011 ) andsimilar SMACto(Hutter al.seen , 2011 approaches to hyperparameter optimization include Spearmint (Snoek et al., 2012), Curren Currently tly tly,, we cannot unambiguously recommend Ba Bay yesian hyp yperparameter erparameter TPE (Bergstra et al., 2011) and SMAC (Hutter et al., 2011). optimization as an established to tool ol for achieving better deep learning results or Curren tly , we cannot unambiguously Ba yesian hypoptimization erparameter for obtaining those results with less effort. recommend Bay Bayesian esian hyp yperparameter erparameter optimization as an established tool achieving better deep learning or sometimes performs comparably to for human exp experts, erts, sometimes better,results but fails for obtaining those with less effort. hyp erparameter catastrophically on results other problems. It mayBay beesian worth trying to see ifoptimization it works on sometimes p erforms comparably to h uman exp erts, sometimes b etter, butbeing fails a particular problem but is not yet sufficiently mature or reliable. That catastrophically on other problems. It may be worth trying to see if it works on said, hyperparameter optimization is an imp important ortant field of researc research h that, while a particular problem but is not yetofsufficiently mature reliable. That eing often driven primarily by the needs deep learning, holdsorthe potential to bbenefit said,only hyperparameter optimization an impbut ortant of researc h that, while not the entire field of machine islearning thefield discipline of engineering in often driven primarily by the needs of deep learning, holds the p otential to b enefit general. not only the entire field of machine learning but the discipline of engineering in One drawbac drawback k common to most hyperparameter optimization algorithms with general. more sophistication than random search is that they require for a training exOne tdrawbac to most hyperparameter algorithms with perimen eriment to run ktocommon completion before they are ableoptimization to extract any information more the sophistication is that in they a training exfrom exp experiment. eriment.than Thisrandom is muc much hsearch less efficient, therequire sense offor how muc uch h inforp erimen t to run to completion b efore they are able to extract any information mation can be gleaned early in an exp experiment, eriment, than manual searc search h by a human from the experiment. is muchtell lessearly efficient, the set sense of how much inforpractitioner, since one This can usually on if in some of hyperparameters is mation can b e gleaned early in an exp eriment, than manual searc h by a human completely pathological. Swersky et al. (2014) ha hav ve introduced an early version practitioner, since one can usually tell early on if some set of hyperparameters is of an algorithm that maintains a set of multiple exp experiments. eriments. At various time completely pathological. Swersky et al. ( 2014 ) ha v e introduced an early version poin oints, ts, the hyperparameter optimization algorithm can cho hoose ose to begin a new of an algorithm that maintains set eriment of multiple eriments. At vor arious time exp experimen erimen eriment, t, to “freeze” a runninga exp experiment thatexp is not promising, to “tha “thaw” w” p oin ts, the hyperparameter optimization algorithm can c ho ose to b egin a new and resume an exp experiment eriment that was earlier frozen but no now w app appears ears promising given exp erimen t, to “freeze” a running exp eriment that is not promising, or to “thaw” more information. and resume an experiment that was earlier frozen but now appears promising given more information.
11.5
Debugging Strategies
When learningStrategies system performs poorly orly,, it is usually difficult to tell 11.5 a machine Debugging whether the poor performance is in intrinsic trinsic to the algorithm itself or whether there When a machine learning system erforms poorlyMachine , it is usually difficult to tell is a bug in the implementation of pthe algorithm. learning systems are whether thedebug poor pfor erformance trinsic to the algorithm itself or whether there difficult to a variety is of in reasons. is a bug in the implementation of the algorithm. Machine learning systems are In most cases, we do not know a priori what the intended beha ehavior vior of the difficult to debug for a variety of reasons. algorithm is. In fact, the en entire tire poin ointt of using machine learning is that it will In most cases, w e do not know priori what the intended beha of the disco discov ver useful beha ehavior vior that we werea not able to sp specify ecify ourselv ourselves. es. Ifvior we train a algorithm is. In fact, the entire point of using machine learning is that it will 439 able to sp ecify ourselves. If we train a discover useful behavior that we were not
CHAPTER 11. PRACTICAL METHODOLOGY
neural netw network ork on a new classification task and it achiev achieves es 5% test error, we ha hav ve no straightforw straightforward ard wa way y of kno knowing wing if this is the exp expected ected behavior or sub-optimal neural network on a new classification task and it achieves 5% test error, we have b eha ehavior. vior. no straightforward way of knowing if this is the expected behavior or sub-optimal Avior. further difficult difficulty y is that most mac machine hine learning mo models dels hav havee multiple parts beha that are eac each h adaptive. If one part is brok broken, en, the other parts can adapt and still A further difficult y is that most machine learning supp models multiple parts ac achiev hiev hieve e roughly acceptable performance. For example, suppose ose hav thate we are training are net eachwith adaptive. one is broken,bythe other parts adapt and still b. Supp athat neural severalIfla lay yers part parametrized weigh eights ts W andcan biases Suppose ose ac hiev e roughly acceptable p erformance. F or example, supp ose that we are training further that we ha have ve manually implemented the gradient descent rule for each W and b. Suppose aparameter neural netseparately with several parametrized weigh tsdate biases separately,, andlaywers e made an error biny the up update for the biases: further that we have manually implemented the gradient descent rule for each parameter separately, and we madeban ← error b − αin the up date for the biases: (11.4) b b αup (11.4) where α is the learning rate. This erroneous update date do does es not use the gradient at all. It causes the biases to constantly ← become negativee throughout learning, which − negativ α where is the learning rate. This erroneous up date does learning not use the gradientThe at is clearly not a correct implementation of any reasonable algorithm. all. It causes the biases to constantly b ecome negativ e throughout learning, which bug ma may y not be apparent just from examining the output of the mo model del though. is clearly not correct implementation of anythe reasonable learning algorithm. Dep Depending ending onathe distribution of the input, weigh weights ts ma may y be able to adaptThe to bug ma y not b e apparent just from examining the output of the mo del though. comp compensate ensate for the negativ negativee biases. Depending on the distribution of the input, the weights may be able to adapt to Most debugging strategies for neural nets are designed to get around one or comp ensate for the negativ e biases. both of these tw two o difficulties. Either we design a case that is so simple that the Mostbehavior debugging strategies neural nets are designed to get oneone or correct actually can bfor e predicted, or w e design a test thataround exercises both of of the these two net difficulties. Either a case that is so simple that the part neural implemen implementation tationwe in design isolation. correct behavior actually can be predicted, or we design a test that exercises one Some imp important ortant debugging tests include: part of the neural net implementation in isolation. Visualize the mo : When training a mo model del in action model del to detect ob objects jects in Some important debugging tests include: images, view some images with the detections prop proposed osed by the mo model del display displayed ed Visualize the mo del in action : When training a mo del to detect ob jects in sup superimp erimp erimposed osed on the image. When training a generative mo model del of sp speech, eech, listen to images, view some images with the detections prop osed by the mo del display ed some of the sp speec eec eech h samples it pro produces. duces. This may seem obvious, but it is easy to superimp osedpractice on the image. training a generative model of spmeasuremen eech, listen to fall into the of onlyWhen lo at quantitativ looking oking quantitative e performance measurements ts some of the spor eeclog-lik h samples it duces. This may seem it is easy to lik like e accuracy log-likeliho eliho elihoo od.proDirectly observing the obvious, machine but learning mo model del fall into theitspractice onlyyou looking at quantitativ performance measuremen ts p erforming task willofhelp to determine whethere the quan quantitativ titativ titative e performance lik e accuracy or log-lik eliho o d. Directly observing the machine learning mo del num umb bers it achiev achieves es seem reasonable. Ev Evaluation aluation bugs can be some of the most p erforming its task will help you to determine thebquan titativ e performance dev devastating astating bugs because they can misleadwhether you into elieving your system is num bers it well achiev es seem Evaluation bugs can be some of the most p erforming when it is reasonable. not. devastating bugs because they can mislead you into believing your system is Visualize the worst mistakes : Most mo models dels are able to output some sort of performing well when it is not. confidence measure for the task they perform. For example, classifiers based on a Visualize thela worst mistakes : Mostymo ableThe to output someassigned sort of softmax output layer yer assign a probabilit probability to dels eac each hare class. probability confidence for the they erform. of Forthe example, classifiers based onina to the mostmeasure lik likely ely class thustask giv gives es an pestimate confidence the mo model del has softmax output la yer assign a probabilit y to eaclik h eliho class.odThe probability its classification decision. Typically Typically, , maximum likeliho elihoo training results assigned in these to the most lik ely class thus giv es an estimate of the confidence the mo del has in values being overestimates rather than accurate probabilities of correct prediction, its classification decision. Typically, maximum likelihood training results in these 440 values being overestimates rather than accurate probabilities of correct prediction,
CHAPTER 11. PRACTICAL METHODOLOGY
but they are somewhat useful in the sense that examples that are actually less lik likely ely to be correctly lab labeled eled receive smaller probabilities under the mo model. del. By but they are somewhat useful in the sense that examples that are actually less viewing the training set examples that are the hardest to mo model del correctly correctly,, one can likely discov to be er correctly labwith eled the receive under the mo By often discover problems wa way y smaller the dataprobabilities has been prepro preprocessed cessed or del. lab labeled. eled. viewing the training set examples that are the hardest to mo del correctly , one can For example, the Street View transcription system originally had a problem where often discover problems with the way would the data has been prepro labomit eled. the address num umb ber detection system crop the image to too o cessed tightlyorand For example, the Street transcription system problem wherey some of the digits. The View transcription net netw work thenoriginally assigned had verya low probabilit probability thethe address num ber detection would cropthe theimages image to tightly and most omit to correct answer on thesesystem images. Sorting tooiden identify tify the some of the digits. The transcription net w ork then assigned very low probabilit y confiden confidentt mistak mistakes es show showed ed that there was a systematic problem with the cropping. to the correct answer onsystem these to images. theimages images to iden the Mo Modifying difying the detection crop mSorting uc uch h wider resulted intify muc uch h bmost etter confiden t mistak es show ed that there was a systematic problem with the cropping. performance of the ov overall erall system, ev even en though the transcription netw network ork needed Mo difying the detection system to crop m uc h wider images resulted in m uchaddress better to be able to pro process cess greater variation in the p osition and scale of the p erformance of the ov erall system, ev en though the transcription netw ork needed num umb bers. to be able to process greater variation in the p osition and scale of the address Reasoning ab about out softwar softwaree using tr train ain and test err error or or:: It is often difficult to numbers. determine whether the underlying soft softw ware is correctly implemented. Some clues R e asoning ab out softwar e using tr andIf test errorerror : It is often to can be obtained from the train and testain error. training is lo low w butdifficult test error determine whether the underlying soft w are is correctly implemented. Some clues is high, then it is lik likely ely that that the training pro procedure cedure works correctly correctly,, and the can be isobtained from train and error. If reasons. training error is low butpossibility test error mo model del overfitting forthe fundamen fundamental taltest algorithmic An alternative is high, thentest it iserror likelyis that that the training pro works correctly , and the the is that the measured incorrectly duecedure to a problem with saving model overfitting fundamen tal algorithmic reasons. An alternative mo model del is after trainingforthen reloading it for test set ev evaluation, aluation, or if the possibility test data is that the test error is measured incorrectly due to a problem with saving the was prepared differently from the training data. If both train and test error are modelthen afterit training then reloading whether it for test set ev if theortest data high, is difficult to determine there is aluation, a softw software areordefect whether w as mo prepared differently from thefundamental training data. If both train and test are the model del is underfitting due to algorithmic reasons. Thiserror scenario high, then it is difficult to determine requires further tests, describ described ed next.whether there is a software defect or whether the model is underfitting due to fundamental algorithmic reasons. This scenario Fit a tiny dataset dataset:: If you hav havee high error on the training set, determine whether requires further tests, described next. it is due to genuine underfitting or due to a softw software are defect. Usually ev even en small Fit a tiny dataset : If you hav e high error on the training set, determine whether mo models dels can be guaran guaranteed teed to be able fit a sufficien sufficiently tly small dataset. For example, is due to genuine underfitting or due to a softw defect. Usually en biases small ait classification dataset with only one example can bare e fit just by settingev the mo dels can b e guaran teed to b e able fit a sufficien tly small dataset. F or example, of the output lay layer er correctly correctly.. Usually if you cannot train a classifier to correctly a classification dataset with only one can be fit justduce by setting biases lab label el a single example, an auto autoenco enco encoder derexample to successfully repro reproduce a singlethe example of thehigh output lay,er . Usually you cannottly train classifierresembling to correctly with fidelity fidelity, orcorrectly a generativ generative e mo model delifto consisten consistently emita samples a lab el a single example, an auto enco der to successfully repro duce a single example single example, there is a soft softw ware defect preven preventing ting successful optimization on the with high fidelity , or a generativ e mo del to consisten tly emitwith samples resembling a training set. This test can be extended to a small dataset few examples. single example, there is a software defect preventing successful optimization on the Comp Compar ar aree back-pr ack-prop op opagate agate agated d derivatives to numeric numerical al derivatives derivatives:: If you are using training set. This test can be extended to a small dataset with few examples. a soft softw ware framework that requires you to implemen implementt your own gradient comComp ar e b ack-pr op agate d derivatives to numeric : If you are using putations, or if you are adding a new op operation eration toaladerivatives differentiation library and a soft w are framework that requires you to implemen t y our o wn gradient commust define its bprop metho method, d, then a common source of error is implemen implementing ting this putations, or if youincorrectly are adding a new op eration to a these differentiation library and gradien gradientt expression incorrectly. . One wa way y to v erify that deriv derivatives atives are correct bprop must define its method, then a common source of error is implementing this gradient expression incorrectly. One way to verify that these derivatives are correct 441
CHAPTER 11. PRACTICAL METHODOLOGY
is to compare the deriv derivatives atives computed by your implementation of automatic differen differentiation tiation to the deriv derivativ ativ atives es computed by a finite differ differenc enc ences es es.. Because is to compare the derivatives computed by your implementation of automatic f (x +by) a−finite f (x) differences. Because differentiation to the derivfativ 0 es computed (x) = lim , (11.5) →0 f (x + ) f (x) f (xative ) = lim , (11.5) we can approximate the deriv derivative by using a small, finite : − f (x + ) − f (x) we can approximate the deriv 0 f ative (x) ≈by using a small, .finite : (11.6) f (x + ) f (x) ) approximation by. using the center (11.6) We can impro improv ve the accuracyfof(xthe entereed differ differenc enc encee: − ≈ f (approximation x + 12) − f (x −by12 using ) We can improve the accuracy the centered difference: 0 of the f (x) ≈ . (11.7) f (x + ) f (x ) (x)chosen to be large enough .to ensure that the p(11.7) The perturbation size mfust ertur − − bation is not rounded down to too o≈muc much h by finite-precision numerical computations. The perturbation size must chosen to be large enough to ensure that the perturUsually Usually,, we will wan antt to test the gradien gradientt or Jacobian of a vector-v ector-valued alued function bation is notnrounded down too much by finite-precision numerical computations. m g : R → R . Unfortunately Unfortunately,, finite differencing only allows us to take a single Usually , w e will want to test the gradient or Jacobian of a vector-valued function deriv derivative evaluate aluate all R a time. We can either run finite differencing mn times to ev Rative at g :the partial .deriv Unfortunately , finite differencing onlytoallows to takethat a single of derivatives atives of g , or we can apply the test a newusfunction uses mn derivative atjections a time.atWbeoth canthe either runand finite differencing times toweevcan aluate all → random pro projections input output of g . For example, apply g of the partial deriv atives of , or we can apply the test to a new function that uses our test of the implemen implementation tation of the deriv derivativ ativ atives es to f (x ) where f (x) = u T g (vx), random pro jections at b oth the input and output of g . Forfexample, we can apply 0 (x) correctly where u and v are randomly chosen vectors. Computing requires ) =with u gfinite (vx), our test of to thebac implemen tation of the gderiv atives, to wheret fto(xdo b eing able back-propagate k-propagate through correctly correctly, yetf (isx )efficien efficient where vectors. u and v are randomly f (output. x) correctly f has onlychosen differences because a single input Computing and a single It isrequires usually g b eing able to bac k-propagate through correctly , y et is efficien t to do with finite a go goo od idea to rep repeat eat this test for more than one value of u and v to reduce f differences b ecause has only a single input and a single output. It is usually the chance that the test ov overlooks erlooks mistakes that are orthogonal to the random a gojection. od idea to rep eat this test for more than one value of u and v to reduce pro projection. the chance that the test overlooks mistakes that are orthogonal to the random If one has access to numerical computation on complex num numb bers, then there is pro jection. a very efficient way to numerically estimate the gradien gradientt by using complex num umb bers If one has access to n umerical computation on complex num b ers, then there is as input to the function (Squire and Trapp, 1998). The metho method d is based on the a very ation efficient way to numerically estimate the gradient by using complex numbers observ observation that as input to the function (Squire and Trapp, 1998). The method is based on the f (xthat + i) = f (x) + if 0 (x) + O(2) observation (11.8) f (x + i) O ( ) (11.8) real(ff((xx++ii))) = f (x) + if O(2(x ),) +imag( ) = f 0 (x) + O(2 ), (11.9) f (x + i) √ real( f ( x + i )) = f ( x ) + O ( ) , imag( = f is (xno ) +cancellation O( ), (11.9) i = − 1 where . Unlike in the real-v real-valued alued case ab abov ove, e, )there effect ov due to taking the difference betw etween een the value of f at different points. This allows where Unlike the −150 case ab ove, thereOis 2 cancellation effect lik = 10alued the usei = of √ tin tiny y1v. alues of in like e real-v , which mak makee the (no ) error insignificant f due to taking the difference b etw een the v alue of at different p oints. This allows − for all practical purp purposes. oses. the use of tiny values of like = 10 , which make the O( ) error insignificant 442 for all practical purp oses.
CHAPTER 11. PRACTICAL METHODOLOGY
Monitor histo histogr gr grams ams of activations and gr gradient adient adient:: It is often useful to visualize statistics of neural netw network ork activ activations ations and gradien gradients, ts, collected over a large amount Monitor histo gr ams of activations and gr adient : It ation is often useful to visualize of training iterations (maybe one ep epo och). The pre-activ pre-activation value of hidden units statistics of neural netw ork activ ations and gradien ts, collected o v er a large amount can tell us if the units saturate, or ho how w often they do. For example, for rectifiers, of training iterations (maybe one ep o ch). pre-activ ation value units ho how w often are they off off?? Are there unitsThe that are alw always ays off off? ? Foforhidden tanh units, can av tellerage us if of thethe units saturate, or of how often they do. Fortells example, forsaturated rectifiers, the average absolute value the pre-activ pre-activations ations us how howunit often offnetw ? Are that are alwgradients ays off ? quickly For tanh units, the is.are In they a deep network orkthere whereunits the propagated grow or the av erage of the absolute v alue of the pre-activ ations tells us how saturated quic quickly kly vanish, optimization ma may y be hamp hampered. ered. Finally Finally,, it is useful to compare the the unit is. In a deep netw ork where the propagated gradients quickly grow es. or magnitude of parameter gradients to the magnitude of the parameters themselv themselves. quicsuggested kly vanish, hamplik ered. , it is useful to compare the As byoptimization Bottou (2015ma ), y webewould like e theFinally magnitude of parameter up updates dates of parameter gradients to thelike magnitude the parameters es. omagnitude ver a minibatch to represent something 1% of theofmagnitude of the themselv parameter, As by Bottou (2015 ), wemake wouldthe likeparameters the magnitude parameter notsuggested 50% or 0.001% (which would mo mov veofto too o slo slowly). wly).up Itdates may o v er a minibatch to represent something like 1% of the magnitude of the parameter, be that some groups of parameters are mo moving ving at a go goo od pace while others are not 50%When or 0.001% (which would(lik make parameters movsome e too parameters slowly). It may stalled. the data is sparse (like e in the natural language), may b e that some groups of parameters are mo ving at a go o d pace while others are be very rarely up updated, dated, and this should be kept in mind when monitoring their stalled. When the data is sparse (like in natural language), some parameters may ev evolution. olution. be very rarely updated, and this should be kept in mind when monitoring their Finally Finally,, many deep learning algorithms pro provide vide some sort of guaran guarantee tee ab about out evolution. the results pro produced duced at eac each h step. For example, in Part III, we will see some Finally , many deep learning algorithms proby vide some sort of guaran tee to about appro approximate ximate inference algorithms that work using algebraic solutions opthe results pro duced at eac h step. F or example, in Part I I I , we will see some timization problems. Typically these can be debugged by testing eac each h of their appro ximate inference algorithms that work by using algebraic solutions opguaran guarantees. tees. Some guaran guarantees tees that some optimization algorithms offer includetothat timization problems. ypically can after be debugged each of their the ob objectiv jectiv jective e functionTwill neverthese increase one stepby of testing the algorithm, that guaran tees. Some guaran tees that some optimization algorithms offer include that the gradient with resp respect ect to some subset of variables will be zero after eac each h step thethe obalgorithm, jective function will the never increase after one the algorithm, of and that gradient with resp respect ect step to allofvariables will be that zero the gradient with resp ect to some subset of v ariables will b e zero after eac h step at conv convergence. ergence. Usually due to rounding error, these conditions will not hold of the algorithm, that thesogradient with resp to allinclude variables willtolerance be zero exactly in a digitaland computer, the debugging testect should some at convergence. Usually due to rounding error, these conditions will not hold parameter. exactly in a digital computer, so the debugging test should include some tolerance parameter.
11.6
Example: Multi-Digit Num Numb ber Recognition
T o provide an end-to-end description of how to bapply our design metho methodology dology 11.6 Example: Multi-Digit Num er Recognition in practice, we present a brief account of the Street View transcription system, T o provide an end-to-end of how apply our design metho dology, from the point of view of description designing the deep to learning comp components. onents. Obviously Obviously, in practice, we present a brief account of the Street View transcription system, man many y other comp components onents of the complete system, such as the Street View cars, the from the infrastructure, point of view of theofdeep learningimp comp onents. Obviously, database anddesigning so on, were paramount importance. ortance. many other components of the complete system, such as the Street View cars, the From the point of view of the machine learning task, the pro process cess began with database infrastructure, and so on, were of paramount importance. data collection. The cars collected the ra raw w data and human op operators erators provided F rom the p oint of view of the machine learning task, the pro cess bof egan with lab labels. els. The transcription task was preceded by a significant amount dataset data collection. The cars collected the raw data and human operators provided 443 labels. The transcription task was preceded by a significant amount of dataset
CHAPTER 11. PRACTICAL METHODOLOGY
curation, including using other machine learning techniques to detect the house num umb bers prior to transcribing them. curation, including using other machine learning techniques to detect the house The transcription pro project ject began with a choice of performance metrics and numbers prior to transcribing them. desired values for these metrics. An imp important ortant general principle is to tailor the The transcription pro ject b egan with choice of performance and choice of metric to the business goals for theapro project. ject. Because maps aremetrics only useful desired values for accuracy these metrics. imp ortant general principle is to tailor the if they hav have e high accuracy, , it wasAn imp importan ortan ortant t to set a high accuracy requirement choice metric the business goals theto proobtain ject. Because maps useful. for thisofpro project. ject.toSpecifically Specifically, , the goalforwas human-lev human-level, el, are 98%only accuracy accuracy. if they havof e high accuracy was imp t to set to a high accuracy requirement This level accuracy may, it not alwa always ysortan be feasible obtain. In order to reach for this pro ject. Specifically , the goal was to obtain human-lev el, 98% accuracy this lev level el of accuracy accuracy,, the Street View transcription system sacrifices co cov verage.. This levelthus of accuracy maymain not palwa ys be feasible obtain. during In order reach Co Cov verage became the erformance metric to optimized thetopro project, ject, this lev el of accuracy , the Street View transcription system sacrifices co v erage. with accuracy held at 98%. As the con conv volutional netw network ork improv improved, ed, it became Co v erage thus b ecame the main p erformance metric optimized during pro ject, possible to reduce the confidence threshold below which the netw network orkthe refuses to with accuracy held at 98%. As the con v olutional netw ork improv ed, it b ecame transcrib transcribee the input, ev even en entually tually exceeding the goal of 95% cov coverage. erage. possible to reduce the confidence threshold below which the network refuses to After cho hoosing osing quantitativ quantitativee goals, the next step in our recommended metho methodoldoltranscribe the input, eventually exceeding the goal of 95% coverage. ogy is to rapidly establish a sensible baseline system. For vision tasks, this means a choosing quantitativ e goals,linear the next stepThe in our recommended metho dolcon conv vAfter olutional netw network ork with rectified units. transcription pro project ject began ogy issuch to rapidly establish sensible baseline system. Ffor or vision tasks, this net means with a mo model. del. At thea time, it was not common a conv convolutional olutional netw worka conoutput volutional networkofwith rectified linear units. pro ject began to a sequence predictions. In order to bThe egintranscription with the simplest possible with suchthe a mo del.implementation At the time, itofwas common a conv olutional netwof ork n baseline, first thenot output la lay yerfor of the mo model del consisted to output a sequence of predictions. In order to b egin with the simplest p ossible differen differentt softmax units to predict a sequence of n characters. These softmax units baseline, the exactly first implementation the output yer of the mowith del consisted of n were trained the same as if of the task were la classification, each softmax differen t softmax units to predict a sequence of n characters. These softmax units unit trained indep independently endently endently. . were trained exactly the same as if the task were classification, with each softmax recommended metho methodology iteratively ely refine the baseline and test unitOur trained independently . dology is to iterativ whether each change makes an impro improv vemen ement. t. The first change to the Street View Our recommended metho dology is to iterativ refine the baseline and test transcription system was motiv motivated ated by a theoreticalelyunderstanding of the cov coverage erage whether each change makes t. The, first change the Street View metric and the structure of an theimpro data.vemen Sp Specifically ecifically ecifically, the net netw worktorefuses to classify transcription systemerwas ated by aoftheoretical understanding x) cov < erage t for an input x whenev whenever themotiv probability the output sequence p (yof|the metric and the structure of the data. Sp ecifically , the net w ork refuses to classify some threshold t. Initially Initially,, the definition of p(y | x) was ad-ho ad-hoc, c, based on simply x) < t fort x (y developmen an input whenev er the probability of the output sequence multiplying all of the softmax outputs together. This motiv motivated atedpthe development t. Initially some threshold , the of p(y ) was ad-ho c, based simply | a on of a sp specialized ecialized output lay layer er anddefinition cost function thatxactually computed principled m ultiplying all This of the softmax outputs together. motivmechanism ated the developmen | This log-lik log-likeliho eliho elihoood. approac approach h allow allowed ed the example rejection to functiont ofuc a hspmore ecialized output. layer and cost function that actually computed a principled m uch effectively effectively. log-likelihood. This approach allowed the example rejection mechanism to function At this point, cov coverage erage was still below 90%, yet there were no ob obvious vious theoretical much more effectively. problems with the approach. Our metho methodology dology therefore suggests to instrumen instrumentt A t this p oint, cov erage was still b elow 90%, yet there were no ob vious theoretical the train and test set performance in order to determine whether the problem problems with the approach. Our metho dology therefore suggests to instrumen is underfitting or overfitti verfitting. ng. In this case, train and test set error were nearlyt the train and test set p erformance in order to determine the problem iden identical. tical. Indeed, the main reason this pro project ject pro proceeded ceeded whether so smo smoothly othly was the underfitting overfitti ng. tens In this case, train and set error were nearly aisvailability of aor dataset with of millions of lab labeled eledtest examples. Because train identical. Indeed, the main reason this pro ject proceeded so smoothly was the availability of a dataset with tens of millions of labeled examples. Because train 444
CHAPTER 11. PRACTICAL METHODOLOGY
and test set error were so similar, this suggested that the problem was either due to underfitting or due to a problem with the training data. One of the debugging and test set were soissimilar, this suggested that the problem due strategies weerror recommend to visualize the mo model’s del’s wor worst st errors. In was this either case, that to underfitting or the due incorrect to a problem withset thetranscriptions training data. One thedeldebugging mean meant t visualizing training that theofmo model gav gavee the strategies we recommend is to visualize the mo del’s wor st errors. In this case, that highest confidence. These prov proved ed to mostly consist of examples where the input meant had visualizing the incorrect training setsome transcriptions thatofthe del gavbeeing the image been cropp cropped ed to too o tightly tightly, , with of the digits themo address highest These op prov ed to mostly consist aofphoto examples the “1849” input remo remov vedconfidence. by the cropping operation. eration. For example, of anwhere address image beenedcropp ed too, tightly , with the digitsvisible. of the address being migh mightt bhad e cropp cropped to too o tightly tightly, with only thesome “849”ofremaining This problem remo v ed by the cropping op eration. F or example, a photo of an address “1849” could hav havee been resolved by sp spending ending weeks improving the accuracy of the address migh cropped system too tightly ,onsible with only “849” remaining visible. This problem n um umb bterbedetection resp responsible for the determining the cropping regions. Instead, could hav e b een resolved by sp ending w eeks improving the accuracy of the the team to took ok a muc uch h more practical decision, to simply expand the widthaddress of the n um b er detection system resp onsible for determining the cropping regions. Instead, crop region to be systematically wider than the address num number ber detection system the team to ok a m uc h more practical decision, to simply expand thetranscription width of the predicted. This single change added ten percentage points to the crop region to be systematically wider than the address number detection system system’s cov coverage. erage. predicted. This single change added ten percentage points to the transcription Finally Finally, , the last few percentage points of performance came from adjusting system’s cov erage. hyp yperparameters. erparameters. This mostly consisted of making the mo model del larger while mainFinally , the last few p ercentage p oints of p erformance taining some restrictions on its computational cost. Becausecame trainfrom and adjusting test error hyperparameters. This mostly consisted making the model larger while mainremained roughly equal, it was alwa always ys clearofthat an any y performance deficits were due taining some restrictions on its computational cost. Because train and test error to underfitting, as well as due to a few remaining problems with the dataset itself. remained roughly equal, it was always clear that any performance deficits were due Ov Overall, erall, the as transcription project was a great success, and allow allowed hundreds of to underfitting, well as duepro to ject a few remaining problems with theeddataset itself. millions of addresses to be transcrib transcribed ed both faster and at low lower er cost than would the transcription ject was a great success, and allowed hundreds of ha hav veOv berall, een possible via humanpro effort. millions of addresses to be transcribed both faster and at lower cost than would hop hope that the principles described ed in this chapter will lead to man many y haveWbeeen peossible viadesign human effort. describ other similar successes. We hope that the design principles described in this chapter will lead to many other similar successes.
445
Chapter 12 Chapter 12
Applications Applications In this chapter, we describ describee ho how w to use deep learning to solv solvee applications in computer vision, sp speech eech recognition, natural language processing, and other application In thisofccommercial hapter, we describ e ho useby deep learningthe to large solve scale applications in comareas in interest. terest. Wwe btoegin discussing neural netw network ork puter vision, sp eech recognition, natural language processing, and other application implemen implementations tations required for most serious AI applications. Next, we review several areas of commercial terest. Wedeep beginlearning by discussing the used large to scale neural netwone ork sp specific ecific application in areas that has been solve. While implemen tations required most algorithms serious AI applications. Next,ofwsolving e reviewaseveral goal of deep learning is tofor design that are capable broad sp ecific areas deepof learning has been used toFor solve. While one v ariet ariety y ofapplication tasks, so far somethat degree sp specialization ecialization is needed. example, vision goal ofrequire deep learning is toa design algorithms that are capable of solving a broad tasks pro processing cessing large num numb ber of input features (pixels) per example. variety of tasks, so far some degreea oflarge specialization needed.values For example, vision Language tasks require mo modeling deling num umb ber ofispossible (w (words ords in the tasks require pro cessing a large num b er of input features (pixels) p er example. vocabulary) per input feature. Language tasks require modeling a large number of possible values (words in the vocabulary) per input feature.
12.1
Large Scale Deep Learning
Deep is based on Deep the philosoph philosophy y of connectionism: while an individual 12.1learning Large Scale Learning biological neuron or an individual feature in a mac machine hine learning mo model del is not Deep learning is based on the philosoph y of connectionism: while an individual in intelligen telligen telligent, t, a large population of these neurons or features acting together can biological neuront b orehavior an individual in a tmac hine learning is not exhibit intelligen intelligent ehavior. . It trulyfeature is imp importan ortan ortant to emphasize the mo factdel that the in telligen t, a large p opulation of these neurons or features acting together can num umb ber of neurons must be lar large ge ge.. One of the key factors resp responsible onsible for the exhibit intelligen t b ehavior . It truly is imp ortan t to emphasize the fact that the impro improv vemen ementt in neural netw network’s ork’s accuracy and the impro improv vemen ementt of the complexity num ber they of neurons meust beeen large . 1980s One of theto kday ey factors responsible for the of tasks can solv solve betw etween the and today is the dramatic increase in impro v emen t in neural netw ork’s accuracy and the impro v emen t of the complexity the size of the net netw works we use. As we sa saw w in Sec. 1.2.3, netw network ork sizes ha hav ve grown of tasks they can solv e b etw een the 1980s and to day is the dramatic increase in exp exponen onen onentially tially for the past three decades, yet artificial neural netw networks orks are only as the size networks we use. As we saw in Sec. 1.2.3, network sizes have grown large as of thethe nervous systems of insects. exponentially for the past three decades, yet artificial neural networks are only as Because the size of neural netw networks orks is of paramount imp importance, ortance, deep learning large as the nervous systems of insects. Because the size of neural networks 446 is of paramount importance, deep learning 446
CHAPTER 12. APPLICATIONS
requires high performance hardware and softw software are infrastructure. requires high performance hardware and software infrastructure.
12.1.1
Fast CPU Implementations
T raditionally raditionally, , neural networks orks were trained using the CPU of a single machine. 12.1.1 Fast CPUnetw Implementations Toda day y, this approac approach h is generally considered insufficient. We no now w mostly use GPU T raditionally , neural netw orks w ere trained using the CPU of a singlemo machine. computing or the CPUs of many mac machines hines net netw work orked ed together. Before moving ving to T o da y , this approac h is generally considered insufficient. W e no w mostly use GPU these exp expensiv ensiv ensivee setups, researchers work worked ed hard to demonstrate that CPUs could computing thehigh CPUs of many macwhines netw orked together. moving to not manageorthe computational orkload required by neuralBefore net netw works. these expensive setups, researchers worked hard to demonstrate that CPUs could A description of how to implemen implementt efficient numerical CPU co code de is beyond not manage the high computational workload required by neural networks. the scop scopee of this bo book, ok, but we emphasize here that careful implementation for A description of how to implemen t efficient numerical CPUinco2011, de is the beyond sp CPU families can yield large improv For example, best specific ecific improvements. ements. the scop e of this bo ok, but we emphasize here that careful implementation for CPUs av available ailable could run neural net netw work workloads faster when using fixed-p fixed-point oint specific CPU families canfloating-p yield large improv ements. By Forcreating example,aincarefully 2011, the best arithmetic rather than floating-point oint arithmetic. tuned CPUs av ailable couldtation, run neural netwkork orkloads whena using fixed-pooint fixed-p fixed-poin oin oint t implemen implementation, Vanhouc anhouck e etwal. (2011)faster obtained 3× sp speedup eedup ver arithmetic rather than floating-p oint arithmetic. By creating a carefully tuned a strong floating-p floating-poin oin ointt system. Each new mo model del of CPU has differen differentt performance oint implemen tation, Vanhouc ke oint et al.implementations (2011) obtained acan 3 bspe eedup ver cfixed-p haracteristics, so sometimes floating-p floating-point faster otoo. a strong floating-p oint system. newsp mo del of CPUofhas differen performance × tcomputation The imp importan ortan ortantt principle is thatEach careful specialization ecialization numerical croutines haracteristics, so sometimes floating-p oint implementations can b e faster can yield a large pa pay yoff. Other strategies, besides choosing whether to too. use The imp ortan t principle is that careful sp ecialization of numerical computation fixed or floating poin oint, t, include optimizing data structures to avoid cache misses routines can yield instructions. a large payoff.Man Other strategies, besidesresearc choosing use and using vector Many y machine learning researchers herswhether neglect to these fixed or floating point, include optimizing data structures to avoid cacherestricts misses implemen implementation tation details, but when the performance of an implementation and size using ector instructions. Manof y machine the of vthe mo model, del, the accuracy the mo model dellearning suffers. researchers neglect these implementation details, but when the performance of an implementation restricts the size of the model, the accuracy of the model suffers.
12.1.2
GPU Implementations
12.1.2 GPU Implementations Most mo modern dern neural netw network ork implemen implementations tations are based on graphics pro processing cessing units. Graphics pro processing cessing units (GPUs) are sp specialized ecialized hardware comp components onents Most mo dern neural netw ork implemen tations are based on graphics pro cessing that were originally dev develop elop eloped ed for graphics applications. The consumer market for units. Graphics pro cessing units (GPUs)tare specialized hardware comp onents video gaming systems spurred developmen development of graphics pro processing cessing hardw hardware. are. The that w ere originally dev elop ed for graphics applications. The consumer market performance characteristics needed for go goo od video gaming systems turn out to for be video gaming systemsnetw spurred developmen t of graphics processing hardware. The b eneficial for neural networks orks as well. performance characteristics needed for good video gaming systems turn out to be Video game rendering requires erforming many op operations erations in parallel quickly quickly.. beneficial for neural networks as wpell. Mo Models dels of ccharacters haracters and en environmen vironmen vironments ts are sp specified ecified in terms of lists of 3-D Video game rendering requires p erforming many operations in parallel quickly co coordinates ordinates of vertices. Graphics cards must perform matrix multiplication and. Models of haracters and in enparallel vironmen are ert specified in terms of listsinto of 2-D 3-D division on cman many y vertices totsconv convert these 3-D co coordinates ordinates coordinates of vertices.The Graphics cards ust perform matrixmany multiplication and on-screen co coordinates. ordinates. graphics cardmmust then perform computations division on man y v ertices in parallel to conv ert these 3-D co ordinates into 2-D at eac each h pixel in parallel to determine the color of each pixel. In both cases, the on-screen coordinates. The graphics card must then perform many computations at each pixel in parallel to determine the color of each pixel. In both cases, the 447
CHAPTER 12. APPLICATIONS
computations are fairly simple and do not inv involv olv olvee muc much h branc branching hing compared to the computational workload that a CPU usually encounters. For example, each computations are fairly and bdo not involvebymuc brancmatrix; hing compared to v ertex in the same rigid simple ob object ject will e multiplied thehsame there is no the computational that a CPU encounters. or example, each need to ev evaluate aluate an workload if statemen statement t per-v er-vertex ertexusually to determine which F matrix to multiply v ertex in the same rigid ob ject will b e multiplied by the same matrix; there is no by. The computations are also en entirely tirely indep independent endent of eac each h other, and thus ma may y need to evaluateeasily an if. The statemen t per-vertex toin determine which matrix multiply b e parallelized easily. computations also inv volv olvee pro processing cessing massivetobuffers of b y. The, computations are also entirely indep endent(color of eacpattern) h other, of and thus ma memory memory, containing bitmaps describing the texture eac each h ob object jecty be bparallelized . The computations involvcards e processing buffers to of to e rendered.easily Together, this results inalso graphics having massive been designed memory , containing bitmaps describing the texture (color pattern) of eac h ob ject ha hav ve a high degree of parallelism and high memory bandwidth, at the cost of to ving be rendered. Together, in graphics cards having beeen to ha having a lo low wer clo cloc ck sp speed eedthis andresults less branc branching hing capabilit capability y relativ relative to designed traditional have a high degree of parallelism and high memory bandwidth, at the cost of CPUs. having a lower clock speed and less branching capability relative to traditional Neural net netw work algorithms require the same performance characteristics as the CPUs. real-time graphics algorithms describ described ed ab abo ove. Neural netw networks orks usually inv involve olve Neural network algorithms require the same pation erformance the large and numerous buffers of parameters, activ activation values,characteristics and gradient vas alues, real-time graphics describ ed aboduring ve. Neural usually inv olve eac each h of whic which h mustalgorithms be completely up updated dated everynetw steporks of training. These large and buffers of parameters, activ values, and gradient values, buffers arenumerous large enough to fall outside the cac cache he ation of a traditional desktop computer eac h of whic h must b e completely up dated during every step of training. These so the memory bandwidth of the system often becomes the rate limiting factor. buffersoffer are large enough to fall outside cache traditional desktopbandwidth. computer GPUs a comp compelling elling adv advan an antage tage overthe CPUs dueoftoa their high memory so the memory of the system oftendobecomes thee rate factor. Neural netw network ork bandwidth training algorithms typically not in inv volv olve muc much hlimiting branc branching hing or GPUs offer a comp elling adv an tage o v er CPUs due to their high memory bandwidth. sophisticated con control, trol, so they are appropriate for GPU hardware. Since neural Neural netw ork training algorithms do “neurons” not involvthat e muc h branc hing or net netw works can be divided in into to multipletypically individual can be pro processed cessed sophisticated trol, soother they neurons are appropriate for GPU Since indep independen enden endently tlycon from the in the same lay layer, er,hardware. neural net netw worksneural easily net w orks can b e divided in to m ultiple individual “neurons” that can b e pro cessed benefit from the parallelism of GPU computing. independently from the other neurons in the same layer, neural networks easily GPU hardware was originally sp specialized ecialized that it could only be used for benefit from the parallelism of GPUsocomputing. graphics tasks. Ov Over er time, GPU hardware became more flexible, allo allowing wing custom GPU hardware was to originally so the specialized thatofitvertices could only be used for subroutines to be used transform co coordinates ordinates or assign colors graphics Over time, GPU becamethat more flexible, wingactually custom to pixels.tasks. In principle, there washardware no requirement these pixelallo values subroutines be used to transform the cocould ordinates of vertices or assign colors b e based on to a rendering task. These GPUs be used for scien scientific tific computing to pixels. In principle, there was no requirement that these pixel v alues actually by writing the output of a computation to a buffer of pixel values. Steinkrau be al. based on) aimplemented rendering task. GPUs be used for scien tific et (2005 a twThese o-la o-layer yer fully could connected neural netw network ork computing on a GPU b y writing the output of a computation to a buffer of pixel v alues. Steinkrau and rep reported orted a 3X sp speedup eedup ov over er their CPU-based baseline. Shortly thereafter, et al. (2005)etimplemented a two-layer fully neural netw ork bon a GPU Chellapilla al. (2006) demonstrated that connected the same technique could e used to and rep orted a 3X sp eedup ov er their CPU-based baseline. Shortly thereafter, accelerate sup supervised ervised conv convolutional olutional netw networks. orks. Chellapilla et al. (2006) demonstrated that the same technique could be used to The popularit opularity y of graphics cards for neural netw network ork training explo exploded ded after the accelerate supervised convolutional networks. adv adven en entt of gener general al purp purpose ose GPUs GPUs.. These GP-GPUs could execute arbitrary co code, de, The p opularit y of graphics cards for neural netw ork training explo ded after the not just rendering subroutines. NVIDIA’s CUDA programming language provided enyt to of write generalthis purp ose GPUs . de These could execute arbitrary coely de, aadv wa way arbitrary co code in a GP-GPUs C-lik C-likee language. With their relativ relatively not rendering subroutines. CUDA programming language provided con conv vjust enien enient t programming mo model, del,NVIDIA’s massive parallelism, and high memory bandwidth, a way to write this arbitrary code in a C-like language. With their relatively convenient programming model, massive parallelism, and high memory bandwidth, 448
CHAPTER 12. APPLICATIONS
GP-GPUs now offer an ideal platform for neural netw network ork programming. This platform was rapidly adopted by deep learning researchers so soon on after it became GP-GPUs now offer an ideal platform for neural netw ork programming. This available (Raina et al., 2009; Ciresan et al., 2010). platform was rapidly adopted by deep learning researchers soon after it became Writing efficient co code for; Ciresan GP-GPUs remains speeavailable (Raina et al. , de 2009 et al. , 2010).a difficult task best left to sp cialists. The tec techniques hniques required to obtain go goo od performance on GPU are very Writing efficient code on forCPU. GP-GPUs remainsgo a odifficult task bco estdeleft to spedifferen different t from those used For example, goo d CPU-based code is usually cialists. The techniques required d uc performance onOn GPU are most very designed to read information from to theobtain cac cache he go asom uch h as possible. GPU, different memory from those used onare CPU. For example, gooactually d CPU-based codetoiscompute usually writable lo locations cations not cached, so it can be faster designed to read information from the cac he as m uc h as p ossible. On GPU, most. the same value twice, rather than compute it once and read it back from memory memory. writable memory cations aremulti-threaded not cached, so itand canthe actually be faster to must compute GPU co code de is alsoloinherently different threads be the same v alue twice, rather than compute it once and read it back from memory co coordinated ordinated with each other carefully carefully.. For example, memory op operations erations are faster. GPU de bise also inherently multi-threaded and othe threads mustcan be if theyco can coalesc alesce ed. Coalesced reads or writes ccurdifferent when several threads co ordinated with each other carefully . F or example, memory op erations are faster eac each h read or write a value that they need simultane simultaneously ously ously,, as part of a single memory if they can beDifferent coalescedmo . Coalesced readsare orable writes ccur when severalkinds threads can transaction. models dels of GPUs to ocoalesce different of read eacwrite h readpatterns. or write aTvypically alue that they need simultane ously , as part of a single memory n or ypically, , memory op operations erations are easier to coalesce if among transaction. Different mo dels of GPUs are able to coalesce different kinds of read threads, thread i accesses byte i + j of memory memory,, and j is a multiple of some power n or write patterns. T ypically , memory op erations are to coalesce if among of 2. The exact sp specifications ecifications differ betw etween een mo models delseasier of GPU. Another common i accesses j is a in threads, threadfor yte i + jsure of memory , and multiple some pow er consideration GPUs isbmaking that eac each h thread a groupofexecutes the of 2. The exact sp ecifications differ betw een mo dels of GPU.can Another common same instruction simultaneously simultaneously. . This means that branching be difficult on consideration for GPUs is making sure that eac h thread in a group executes the GPU. Threads are divided into small groups called warps. Each thread in a warp same instruction . This means that branching be difficult on executes the samesimultaneously instruction during each cycle, so if differen differentt can threads within the GPU.warp Threads small groups called warps . Each thread in a must warp same needare to divided executeinto differen different t co code de paths, these different co code de paths executes the sequentially same instruction during cycle, so if different threads within the b e trav traversed ersed rather than each in parallel. same warp need to execute different code paths, these different code paths must Dueersed to thesequentially difficult difficulty y ofrather writingthan highinperformance GPU co code, de, researc researchers hers should be trav parallel. structure their workflo workflow w to av avoid oid needing to write new GPU co code de in order to test to the difficulty of writing high performance code, researc should newDue mo models dels or algorithms. Typically ypically, , one can do this GPU by building a soft softw whers are library structure their workflo werations to avoidlike needing to writeand newmatrix GPU co de in order tothen test of high performance op operations con conv volution multiplication, new modelsmo ordels algorithms. , one do this byerations. building aFor soft ware library sp specifying ecifying models in terms Tofypically calls to thiscan library of op operations. example, the of high p erformance op erations like con v olution and matrix multiplication, then mac machine hine learning library Pylearn2 (Go Goo odfello dfellow w et al. al.,, 2013c) specifies all of its specifying models algorithms in terms of calls to this library operations. For example, the; mac machine hine learning in terms of calls to of Theano (Bergstra et al., 2010 machineetlearning library Pylearn2vnet (Goo(Krizhevsky dfellow et al. , 2013c ) specifies all of its Bastien al. al.,, 2012 ) and cuda-con cuda-convnet , 2010 ), which provide these mac hine learning algorithms terms of callsapproach to Theano al.ort , 2010 high-p high-performance erformance op operations. erations. inThis factored can(Bergstra also ease et supp support for; Bastien et al. , 2012 ) and cuda-con vnet ( Krizhevsky , 2010 ), which provide these multiple kinds of hardw hardware. are. For example, the same Theano program can run on high-p erformance operations. This factored approach also ease support for either CPU or GPU, without needing to change any of can the calls to Theano itself. multiple kinds of hardw are. Fw or (example, program can on, Other libraries like TensorFlo ensorFlow Abadi et the al., same 2015)Theano and Torch (Collob Collobert ertrun et al. either or GPU, needing to change any of the calls to Theano itself. 2011b 2011b))CPU provide similarwithout features. Other libraries like TensorFlow (Abadi et al., 2015) and Torch (Collobert et al., 2011b) provide similar features.
449
CHAPTER 12. APPLICATIONS
12.1.3
Large Scale Distributed Implemen Implementations tations
In man many y cases, computational resources available on a single machine are 12.1.3 Largethe Scale Distributed Implemen tations insufficien insufficient. t. We therefore wan antt to distribute the workload of training and inference In man y cases, the computational resources available on a single machine are across many machines. insufficient. We therefore want to distribute the workload of training and inference Distributing inference is simple, because eac each h input example we want to pro process cess across many machines. can be run by a separate machine. This is known as data par aral al allelism lelism lelism.. Distributing inference is simple, because each input example we want to process also to machine. get mo model delThis par aral al allelism lelism, , where mac machines can Itbeisrun bypossible a separate islelism known as datamultiple paral lelism . hines work together on a single datap datapoin oin oint, t, with each mac machine hine running a different part of the It isThis alsoispossible moinference del paraland lelism , where multiple machines work mo model. del. feasible to forget both training. together on a single datapoint, with each machine running a different part of the Data parallelism during training is somewhat harder. We can increase the size model. This is feasible for both inference and training. of the minibatch used for a single SGD step, but usually we get less than linear Datainparallelism during training is somewhat e can to increase the size returns terms of optimization performance. It wharder. ould beW better allow m ultiple of the minibatch used m for a single SGD step, butsteps usually we get less than linear, mac machines hines to compute ultiple gradient descent in parallel. Unfortunately Unfortunately, returns in terms of optimization pterformance. ould be better to allowalgorithm: multiple the standard definition of gradien gradient descen descentt is asItawcompletely sequential macgradient hines to at compute multiple gradient stepspro in duced parallel. , the step t is a function of thedescent parameters produced by Unfortunately step t − 1. the standard definition of gradient descent is as a completely sequential algorithm: can at bestep solv solved asynchr asynchronous onous sto stochastic chasticpro gr gradient adientby desc descent entt (Bengio the This gradient tedis using a function of the parameters duced step 1. et al. al.,, 2001; Rec Recht ht et al. al.,, 2011). In this approach, several pro processor cessor cores share − can representing be solved using asynchronous stochastic gradient descentwithout (Bengio the This memory the parameters. Each core reads parameters a et al. , 2001 ; Rec ht et al. , 2011 ). In this approach, several pro cessor cores share lo locck, then computes a gradient, then increments the parameters without a lo lock. ck. the memory representing the parameters. Each core reads parameters without a This reduces the av average erage amount of improv improvement ement that each gradien gradientt descen descentt step lock, then computes a gradient, then increments the parameters without a lock. yields, because some of the cores ov overwrite erwrite eac each h other’s progress, but the increased This of reduces the avof erage ement that each step rate pro production duction stepsamount causes of theimprov learning pro process cess to be gradien faster otvdescen erall. tDean yields, because some ofthe themulti-mac cores overwrite each other’s progress, but the approach increased et al. (2012 ) pioneered multi-machine hine implementation of this lo lock-free ck-free rate of pro duction of steps causes the learning pro cess to b e faster o v erall. Dean to gradient descent, where the parameters are managed by a par arameter ameter server et al. (2012 pioneered the multi-mac hine implementation of this logradient ck-free approach rather than) stored in shared memory memory. . Distributed async asynchronous hronous descent to gradient descent, where the parameters are managed by a p ar ameter server remains the primary strategy for training large deep net netw works and is used by rather than stored in shared memory . Distributed async hronous gradient descent most ma major jor deep learning groups in industry (Chilimbi et al., 2014; Wu et al. al.,, remains the primary strategy for training large deep net w orks and is used by 2015 2015). ). Academic deep learning researchers typically cannot afford the same scale most major deep learning groups industry (Chilimbi et al., on 2014 ; wWto u et al., of distributed learning systems butinsome research has fo focused cused ho how build 2015). Academic deepwith learning researchers typically cannot afford scale distributed netw networks orks relatively low-cost hardw hardware are av available ailable in the the same universit university y of distributed learning systems but some research has fo cused on ho w to build setting (Coates et al., 2013). distributed networks with relatively low-cost hardware available in the university setting (Coates et al., 2013).
12.1.4
Mo Model del Compression
12.1.4 Model Compression In many commercial applications, it is muc much h more imp importan ortan ortantt that the time and memory cost of running inference in a machine learning mo model del be low than that In many commercial applications, it is muc h more imp ortan thatdo thenot time and the time and memory cost of training be low. For applicationstthat require memory cost of running inference in a machine learning model be low than that the time and memory cost of training b450 e low. For applications that do not require
CHAPTER 12. APPLICATIONS
personalization, it is possible to train a mo model del once, then deplo deploy y it to be used by billions of users. In man many y cases, the end user is more resource-constrained than personalization, it is possibleone to train modela once, deploy itnetw to bork e used the developer. For example, mighta train sp speec eec eech hthen recognition network withby a billions of users. In man y cases, the end user is more resource-constrained than powerful computer cluster, then deplo deploy y it on mobile phones. the developer. For example, one might train a speech recognition network with a A key strategy for reducing the cost of inference is mo model del compr ompression ession (Buciluˇa powerful computer cluster, then deploy it on mobile phones. et al. al.,, 2006). The basic idea of mo model del compression is to replace the original, A keye strategy for reducing cost of inference model comprand ession (Buciluˇ exp expensiv ensiv ensive mo model del with a smallerthe mo model del that requiresisless memory runtime toa et al.,and 2006 ). The basic idea of model compression is to replace the original, store ev evaluate. aluate. expensive model with a smaller model that requires less memory and runtime to Mo Model del ev compression is applicable when the size of the original mo model del is driven store and aluate. primarily by a need to prev preven en entt overfitting. In most cases, the mo model del with the Mogeneralization del compression is applicable whenofthe size ofindep the endently original mo del ismo driven lo low west error is an ensemble several independently trained models. dels. primarily b y a need to prev en t o v erfitting. In most cases, the mo del with the Ev Evaluating aluating all n ensem ensemble ble mem memb bers is exp expensive. ensive. Sometimes, even a single mo model del lo w est generalization error is an ensemble of several indep endently trained mo dels. generalizes better if it is large (for example, if it is regularized with drop dropout). out). Evaluating all n ensemble members is expensive. Sometimes, even a single model These large models learn some function (x), but do so using man many y more generalizes better if it is large (for example, iffit is regularized with dropout). parameters than are necessary for the task. Their size is necessary only due to models learn some functionAsf (so , but do hav so using manfunction y more x)on the These limitedlarge number of training examples. soon as we have e fit this parameters are necessary task. Their size isman necessary only simply due to f (x), we canthan generate a trainingfor setthe containing infinitely many y examples, the limited nfumber of training examples. as wetrain havethe fit new, this function x. so b y applying to randomly sampled pointsAs Won e then smaller, f ( x ) , we can generate a training set containing infinitely man y examples, simply mo model del to matc match h f (x) on these poin oints. ts. In order to most efficiently use the capacity f x b y applying to randomly sampled p oints . W e then train the new, smaller, of the new, small mo model, del, it is best to sample the new x poin oints ts from a distribution f ( x mo del to matc h ) on these p oin ts. In order to most efficiently the This capacity resem resembling bling the actual test inputs that will be supplied to the mo model deluse later. can ofe the new, model, it is best to sample the new x pp oin ts from b done by small corrupting training examples or b y drawing oints froma adistribution generative resem the on actual inputs that will mo model delbling trained the test original training set.be supplied to the model later. This can be done by corrupting training examples or by drawing points from a generative Alternativ Alternatively ely ely,, one can train the smaller model only on the original training model trained on the original training set. poin but train it to cop oints, ts, copy y other features of the mo model, del, such as its posterior Alternativ ely , one can train the smaller model only the).original training distribution ov over er the incorrect classes (Hin Hinton ton et al. al.,, 2014on , 2015 points, but train it to copy other features of the model, such as its posterior distribution over the incorrect classes (Hinton et al., 2014, 2015).
12.1.5
Dynamic Structure
12.1.5 Dynamic Structure One strategy for accelerating data pro processing cessing systems in general is to build systems that hav havee dynamic structur structuree in the graph describing the computation needed to One strategy for accelerating data processing in general isdetermine to build systems pro an input. Data pro systemssystems can dynamically whic process cess processing cessing which h that hav e dynamic structur e in the graph describing the computation needed to subset of man many y neural net netw works should be run on a giv given en input. Individual neural pro cess Data dynamic processing systems can dynamically determine which net netw w orksan caninput. also exhibit structure in internally ternally by determining which subset subset of man y neural net w orks should b e run on a giv en input. Individual neural of features (hidden units) to compute giv given en information from the input. This net w orks can also exhibit dynamic structure in by determining subset form of dynamic structure inside neural net netw wternally orks is sometimes calledwhich conditional features (hidden compute en information fromcomp the onents input. ofThis cofomputation (Bengiounits) , 2013; to Bengio et al.giv , 2013b ). Since many components the form of dynamic structure inside neural net w orks is sometimes called c onditional arc architecture hitecture may be relev relevant ant only for a small amount of possible inputs, the system computation (Bengio, 2013; Bengio et al., 2013b). Since many components of the architecture may be relevant only for a small 451 amount of p ossible inputs, the system
CHAPTER 12. APPLICATIONS
can run faster by computing these features only when they are needed. Dynamic structure of computations is a basic computer science principle applied can run faster by computing these features only when they are needed. generally throughout the soft softw ware engineering discipline. The simplest version versionss Dynamic structure of computations is a basic computer science principle of dynamic structure applied to neural netw networks orks are based on determiningapplied whic which h generally throughout the soft w are engineering discipline. The simplest version subset of some group of neural netw networks orks (or other machine learning mo models) dels) shoulds ofe dynamic structure applied to neural networks are based on determining which b applied to a particular input. subset of some group of neural networks (or other machine learning models) should A venerable strategy for accelerating inference in a classifier is to use a casc ascade ade be applied to a particular input. of classifiers. The cascade strategy may be applied when the goal is to detect the A venerable strategy a classifier toject useisa present, cascade presence of a rare ob object jectfor (oraccelerating even event). t). To inference kno know w for in sure that theisob object ofeclassifiers. cascade strategy may be high applied when is ensiv to detect the w must use aThe sophisticated classifier with capacit capacity y, the thatgoal is exp expensiv ensive e to run. presence a rare ob (or even t). T for sureuse that the ob ject is present, Ho How wev ever, er,ofbecause theject ob object ject is rare, woe kno canwusually muc much h less computation we reject must use a sophisticated classifier high In capacit that is expensiv e totrain run. to inputs as not con containing taining thewith ob object. ject. thesey, situations, we can Ho w ev er, b ecause the ob ject is rare, w e can usually use muc h less computation a sequence of classifiers. The first classifiers in the sequence hav havee low capacity capacity,, to reject inputs to ashav note high containing ob ject. In these situations, can train and are trained have recall. the In other words, they are trained we to make sure a sequence of classifiers. The first classifiers in the sequence hav e low capacity we do not wrongly reject an input when the ob object ject is presen present. t. The final classifier, and are trained to hav e high recall. In other words, they are trained make sure is trained to hav havee high precision. At test time, we run inference byto running the w e do not wrongly reject an input when the ob ject is presen t. The final classifier classifiers in a sequence, abandoning an any y example as so soon on as any one element in is trained have high precision. t test time, running the the cascadetorejects it. Overall, this A allo allows ws us to vwe erifyrun theinference presence bofy ob objects jects with classifiers in a sequence, abandoning ydel, example as onforce as any element in high confidence, using a high capacit capacity y an mo model, but do does es so not us one to pay the cost thefull cascade rejects Overall, this allo ws us voerify the presence of ob jects with of inference for it. ever every y example. There aretotw two different wa ways ys that the cascade high confidence, using a high capacit y mo del, but do es not force us to pay the cost can achiev achievee high capacit capacity y. One way is to make the later members of the cascade of full inference evercapacity y example. There arethe twosystem different ys that the cascade individually ha hav vefor high capacity. . In this case, as wa a whole obviously has can achiev e high capacit y . One w ay is to make the later members of the cascade high capacity capacity,, because some of its individual mem memb bers do. It is also possible to individually ha v e high capacity . In this case, the system a whole obviously has mak makee a cascade in which ev every ery individual mo model del has low as capacit capacity y but the system high capacity , b ecause some of its individual mem b ers do. It is also p ossible to as a whole has high capacity due to the combination of many small mo models. dels. Viola makeJones a cascade which every individual del hastrees low capacit y buttthe system and (2001)inused a cascade of boostedmo decision to implemen implement a fast and as a whole has high capacity due to the combination of many small mo dels. Viola robust face detector suitable for use in handheld digital cameras. Their classifier and Jonesa face (2001using ) used a cascade boosted decision treesh to t a windo fast and lo localizes calizes essen essentially tially a of sliding window approac approach in implemen whic which h many windows ws robust face detector suitable for use in handheld digital cameras. Their classifier are examined and rejected if they do not con contain tain faces. Another version of cascades localizes face using essen a slidingawindow h intion whicmec h many windothe ws uses the aearlier mo models dels totially implement sort of approac hard atten attention mechanism: hanism: are examined rejected if they do not tain faces.later Another version of cascades early mem memb bersand of the cascade lo localize calize ancon ob object ject and members of the cascade uses the earlier mo dels to implement a sort of hard atten tion mec hanism: the perform further pro processing cessing given the lo location cation of the ob object. ject. For example, Go Google ogle early mem b ers of the cascade lo calize an ob ject and later members of the cascade transcrib transcribes es address num umb bers from Street View imagery using a tw two-step o-step cascade perform pro cessing given the lo cation of the ob ject. F or example, Gothen ogle that firstfurther lo the address num er with one machine learning mo locates cates numb b model del and transcrib es it address numbers(Go from Street imagery transcrib transcribes es with another Goo odfello dfellow w etView al. al.,, 2014d ). using a two-step cascade that first locates the address number with one machine learning model and then Decision trees themselves are an example of dynamic structure, because each transcribes it with another (Goodfellow et al., 2014d). no node de in the tree determines whic which h of its subtrees should be ev evaluated aluated for eac each h input. Decision trees themselves are example of dynamic because each A simple wa way y to accomplish the an union of deep learning structure, and dynamic structure node in the tree determines which of its subtrees should be evaluated for each input. A simple way to accomplish the union of deep learning and dynamic structure 452
CHAPTER 12. APPLICATIONS
is to train a decision tree in which eac each h no node de uses a neural netw network ork to mak makee the splitting decision (Guo and Gelfand, 1992), though this has typically not been is to train a decision in of which each node uses a computations. neural network to make the done with the primarytree goal accelerating inference splitting decision (Guo and Gelfand, 1992), though this has typically not been Inwith the same spirit, one canofuse a neural netw network, ork, called the gater to select whic which h done the primary goal accelerating inference computations. one out of sev several eral exp expert ert networks will be used to compute the output, given the In tthe sameThe spirit, can use a neural ork, the called the egater toerts select which, curren current input. firstone version of this idea netw is called mixtur mixture of exp experts (Nowlan one out of sevet eralal. exp ert networks willthe be gater used to compute theofoutput, given the 1990 ; Jacobs al., , 1991 ), in which outputs a set probabilities or curren t input. The first version of this idea is called the mixtur e of exp erts ( Nowlan weigh eights ts (obtained via a softmax nonlinearit nonlinearity), y), one per exp expert, ert, and the final output, 1990 ; Jacobs et al. , 1991 ), in which the gater outputs a probabilities or is obtained by the weigh weighted ted combination of the output ofset theofexp experts. erts. In that weighthe ts (obtained a softmax nonlinearit y), one pin er computational expert, and thecost, finalbut output case, use of thevia gater do does es not offer a reduction if a is obtained b y the weigh ted combination of the output of the exp erts. In that single exp expert ert is chosen by the gater for eac each h example, we obtain the har hard d mixtur mixturee case, the use of the gater do es not offer a reduction in computational cost, but if a of exp experts erts (Collob Collobert ert et al., 2001, 2002), which can considerably accelerate training single exp ert is chosen by the gater for eac h example, we obtain the har d mixtur and inference time. This strategy works well when the num numb ber of gating decisions ise of experts (Collob al.,binatorial. 2001, 2002), which accelerate small because it isert notetcom combinatorial. But whencan weconsiderably wan wantt to select differen differentttraining subsets and inference time. This strategy w orks w ell when the num b er of gating decisions is of units or parameters, it is not possible to use a “soft switch” because it requires small because (and it is not combinatorial. we wan t to select differentTsubsets en enumerating umerating computing outputsBut for)when all the gater configurations. o deal of units or parameters, it is not p ossible to use a “soft switch” b ecause it requires with this problem, sev several eral approaches hav havee been explored to train combinatorial enumerating (and computing outputs for)with all the gater configurations. To deal gaters. Bengio et al. (2013b) exp experiment eriment several estimators of the gradient with problem, several approaches e b(een explored to train combinatorial on thethis gating probabilities, while Baconhav et al. 2015 ) and Bengio et al. (2015a) use gaters. Bengio et al. ( 2013b ) exp eriment with several estimators of the gradient reinforcemen reinforcementt learning techniques (p (policy olicy gradient) to learn a form of conditional on the gating probabilities, while Bacon et al. ( 2015 ) and Bengio et al. ( 2015a ) use drop dropout out on blo bloccks of hidden units and get an actual reduction in computational reinforcemen learning techniques gradient) a ximation. form of conditional cost without timpacting negatively (p onolicy the quality of to thelearn appro approximation. dropout on blocks of hidden units and get an actual reduction in computational Another kind of dynamic structure is a switc switch, h, where a hidden unit can cost without impacting negatively on the quality of the approximation. receiv receivee input from differen differentt units dep depending ending on the con context. text. This dynamic routing Another kind of dynamic structure is a switc h, where a hidden unit can). approac approach h can be in interpreted terpreted as an atten attention tion mec mechanism hanism (Olshausen et al. , 1993 receiv input t units dep ending on effectiv the conetext. This dynamic routing So far,e the usefrom of a differen hard switc switch h has not prov proven en effective on large-scale applications. approac horary can bapproac e interpreted as anuse atten mecahanism (Olshausen et al.,inputs, 1993). Con Contemp temp temporary approaches hes instead a wtion eigh eighted ted verage ov over er many possible So far, thedo usenot of aachiev hard eswitc h has provencomputational effective on large-scale and th thus us achieve all of thenot possible benefits applications. of dynamic Con temp orary approac hes instead use a w eigh ted a verage ov er many p ossible inputs, structure. Con Contemp temp temporary orary atten attention tion mechanisms are describ described ed in Sec. 12.4.5.1 . and thus do not achieve all of the possible computational benefits of dynamic One ma major jor obstacle to using dynamically structured systems is the decreased structure. Contemporary attention mechanisms are described in Sec. 12.4.5.1. degree of parallelism that results from the system following different co code de branc branches hes One ma jor obstacle to using dynamically structured systems is the decreased for differen differentt inputs. This means that few op operations erations in the net netw work can be describ described ed degree of parallelism that results from the system following different co de branc hes as matrix multiplication or batc batch h con conv volution on a minibatch of examples. We for differen t inputs. This means that few that operations in the net work canwith be describ edt can write more sp specialized ecialized sub-routines conv convolve olve eac each h example differen different asernels matrix batc volution on abyminibatch examples. We k or multiplication multiply each or row of ha con design matrix a differen differentof t set of columns canweigh writets. more specialized sub-routines convolve subroutines each exampleare with differen of weights. Unfortunately Unfortunately, , these morethat sp specialized ecialized difficult tot k ernels or multiply each row of a design matrix by a differen t set of columns implemen implementt efficiently efficiently.. CPU implementations will be slo slow w due to the lack of cache of weights.and Unfortunately , these morewill specialized subroutines areofdifficult to coherence GPU implementations be slow due to the lack coalesced implement efficiently. CPU implementations will be slow due to the lack of cache coherence and GPU implementations will be slow due to the lack of coalesced 453
CHAPTER 12. APPLICATIONS
memory transactions and the need to serialize warps when mem memb bers of a warp take differen differentt branc branches. hes. In some cases, these issues can be mitigated by partitioning the memory transactions warps when mem bers ofthese a warp take examples into groupsand thatthe allneed take to theserialize same branch, and pro processing cessing groups differen t branc hes. In some cases, issues be mitigated by partitioning the of examples sim simultaneously ultaneously ultaneously. . Thisthese can b e an can acceptable strategy for minimizing examples into groups that all take the same branch, and pro cessing these groups the time required to pro process cess a fixed amoun amountt of examples in an offline setting. In examplessetting simultaneously . This can beban acceptable strategy for minimizing aofreal-time where examples must e pro processed cessed contin continuously uously uously, , partitioning the time required process a fixed amounissues. t of examples in an offline In the workload can to result in load-balancing For example, if we setting. assign one a real-time examples be proand cessed continuously , partitioning mac machine hine to setting pro process cesswhere the first step inmust a cascade another machine to pro process cess the workload can result in load-balancing issues. F or example, if we assign one the last step in a cascade, then the first will tend to be overloaded and the last machine the first step in aissues cascade and another machine to process will tend to to pro be cess underloaded. Similar arise if each machine is assigned to the last step in a cascade, then the first will tend to b e o verloaded and the last implemen implementt different no nodes des of a neural decision tree. will tend to be underloaded. Similar issues arise if each machine is assigned to implement different nodes of a neural decision tree.
12.1.6
Sp Specialized ecialized Hardw Hardware are Implemen Implementations tations of Deep Netw Networks orks
Since the early days of neural net netw works researc research, h, hardw hardware are of designers have e work worked ed 12.1.6 Specialized Hardw are Implemen tations Deep hav Netw orks on sp specialized ecialized hardw hardware are implementations that could speed up training and/or Since the early days net of neural networks researc hardw aremore designers e worked inference of neural netw work algorithms. See h, early and recen recentthav reviews of on sp ecialized hardw are implementations that could speed up training and/or sp specialized ecialized hardware for deep netw networks orks (Lindsey and Lindblad, 1994; Beiu et al., inference of neural net w ork algorithms. See early and more recent reviews of 2003; Misra and Saha, 2010). specialized hardware for deep networks (Lindsey and Lindblad, 1994; Beiu et al., Different forms of, sp specialized ecialized hardw hardware are (Graf and Jac Jack kel, 1989; Mead and 2003Differen ; Misratand Saha 2010 ). Ismail, 2012; Kim et al. al.,, 2009; Pham et al. al.,, 2012; Chen et al. al.,, 2014a,b) ha have ve Differen t forms of sp ecialized hardw are ( Graf and Jac k el , 1989 ; Mead and been developed ov over er the last decades, either with ASICs (application-sp (application-specific ecific inteIsmail , 2012 ; Kim et al. , 2009 ; Pham et al. , 2012 ; Chen et al. , 2014a ,b) bha ve grated circuit), either with digital (based on binary representations of num umb ers), been developed ovJac er the either with, ASICs (application-sp ecificimpleinteanalog (Graf and 1989decades, ; Mead and Ismail 2012) (based on ph Jackel kel,last physical ysical grated circuit), either with digital (based on representations of num bers), men mentations tations of con contin tin tinuous uous values as voltages or binary currents) or hybrid implemen implementations tations analog (Grafdigital and Jac kelanalog , 1989;comp Meadonents). and Ismail , 2012)y(based on ph ysicalFPGA imple(com (combining bining and components). In recent ears more flexible men tations of con tin uous v alues as voltages or currents) or hybrid implemen tations (field programmable gated array) implemen implementations tations (where the particulars of the (com bining andon analog comp onents). recent years more flexible FPGA circuit can bdigital e written the chip after it hasIn been built) hav developed. have e been (field programmable gated array) implementations (where the particulars of the Though softw ware implemen implementations tations general-purpose ose pro processing units (CPUs circuit can bsoft e written on the chip afteronitgeneral-purp has been built) havcessing e been developed. and GPUs) typically use 32 or 64 bits of precision to represent floating point Though are implemen tations on itgeneral-purp oseto pro cessing units (CPUs num umb bers, it soft haswlong been known that was possible use less precision, at and GPUs) typically 32 or 64Bak bitserof precision point least at inference timeuse (Holt and Baker , 1991 ; Holi to andrepresent Hwang, floating 1993; Presley numHaggard bers, it has long been and known that it ;was possible et toal. use less; precision, at, and , 1994 ; Simard Graf , 1994 Wawrzynek al., , 1996 Sa Savic vic vich h et al. al., least).atThis inference time (Holt andpressing Baker, issue 1991; inHoli and Hwang 1993;learning Presley 2007 2007). has become a more recen recent t years as, deep and Haggard , 1994 ; Simard and Graf , 1994 ; W awrzynek et al. , 1996 ; Sa vic h et al., has gained in popularity in industrial pro products, ducts, and as the great impact of faster 2007 ). are This has become a more issue in recen as deep hardw hardware was demonstrated withpressing GPUs. Another factort years that motiv motivates ateslearning current has gained pecialized opularityhardware in industrial products, andisasthat the the great impact of faster researc research h on in sp specialized for deep netw networks orks rate of progress of hardw are was demonstrated with GPUs. Another factor that motiv ates current a single CPU or GPU core has slo slow wed down, and most recent improv improvements ements in researc h on sp ecialized hardware for deep netw orks is that the rate of progress of computing sp speed eed hav havee come from parallelization across cores (either in CPUs or a single CPU or GPU core has slowed down, and most recent improvements in 454 computing speed have come from parallelization across cores (either in CPUs or
CHAPTER 12. APPLICATIONS
GPUs). This is very different from the situation of the 1990s (the previous neural net netw work era) where the hardw hardware are implementations of neural netw networks orks (whic (which h might GPUs). This is very different from the situation of the 1990s (the previous tak takee two years from inception to av availability ailability of a chip) could not keep upneural with net w ork era) where the hardw are implementations of neural netw orks (whic h might the rapid progress and low prices of general-purp general-purpose ose CPUs. Building sp specialized ecialized tak e t w o years from inception to av ailability of a chip) could not keep up with hardw hardware are is th thus us a way to push the env envelop elop elopee further, at a time when new hardware the rapidare progress andelop lowed prices of general-purp ose suc CPUs. ecialized designs being dev develop eloped for low-pow low-power er devices such h as Building phones, sp aiming for hardw are is th us a w a y to push the env elop e further, at a time when new hardware general-public applications of deep learning (e.g., with sp speech, eech, computer vision or designs are b eing dev elop ed for low-pow er devices suc h as phones, aiming for natural language). general-public applications of deep learning (e.g., with speech, computer vision or Recen Recentt work on lo low-precision w-precision implemen implementations tations of backprop-based neural nets natural language). (Vanhouc anhoucke ke et al. al.,, 2011; Courbariaux et al. al.,, 2015; Gupta et al. al.,, 2015) suggests Recen t work on lo w-precision implemen tations of backprop-based neural deep nets that betw between een 8 and 16 bits of precision can suffice for using or training (neural Vanhouc ke orks et al.with , 2011 ; Courbariaux et What al., 2015 Gupta et al.more , 2015 ) suggests netw networks bac back-propagation. k-propagation. is ;clear is that precision is that betw een 8 and 16 bits of precision can suffice for using or training deep required during training than at inference time, and that some forms of dynamic neuralpoin netw orks with back-propagation. is clear is thathow more precision is fixed oint t represen representation tation of num numb bers canWhat be used to reduce many bits are required during training than at inference time, andbthat some forms oftodynamic required per num numb ber. Traditional fixed poin oint t num numb ers are restricted a fixed fixed p oin t represen tation of num b ers can b e used to reduce how many bits are range (which corresp corresponds onds to a given exp exponen onen onentt in a floating point representation). required p er num b er. T raditional fixed p oin t num b ers are restricted to a fixed Dynamic fixed point represen representations tations share that range among a set of num numb bers range (which onds to a la given onen t in apoin floating point representation). (suc (such h as all thecorresp weigh weights ts in one layer). yer). exp Using fixed oint t rather than floating point Dynamic fixed and pointusing represen tations that rangethe among a are set surface of numarea, bers represen representations tations less bits per share num numb ber reduces hardw hardware (suc all the weighand ts incomputing one layer).time Using fixed pfor oinpt erforming rather than floating point powherasrequirements needed multiplications, represen tations and using less bits p er num b er reduces the hardw are surface area, and multiplications are the most demanding of the op operations erations needed to use or powera requirements time needed for performing multiplications, train mo modern dern deep and net netw wcomputing ork with backprop. and multiplications are the most demanding of the operations needed to use or train a modern deep network with backprop.
12.2
Computer Vision
Computer vision has traditionally activee research areas for 12.2 Computer Vision been one of the most activ deep learning applications, because vision is a task that is effortless for humans Computer vision has eencomputers one of the(Ballard most activ e research and many animals buttraditionally challenging bfor et al. , 1983). areas Man Many yfor of deep learning applications, b ecause vision is a task that is effortless for humans the most popular standard benc enchmark hmark tasks for deep learning algorithms are forms and many animals but challenging for computers (Ballard et al., 1983). Many of of ob object ject recognition or optical character recognition. the most popular standard benchmark tasks for deep learning algorithms are forms Computer vision or is aoptical very broad fieldrecognition. encompassing a wide variety of wa ways ys of ob ject recognition character of pro processing cessing images, and an amazing div diversit ersit ersity y of applications. Applications of Computer vision is a v ery broad field encompassing a wide variety of ways computer vision range from repro reproducing ducing human visual abilities, suc such h as recognizing of pro cessing images, and an amazing div ersit y of applications. Applications of faces, to creating entirely new categories of visual abilities. As an example of computer range from repro ducing human abilities, sucrecognize h as recognizing the latter vision category category, , one recent computer visionvisual application is to sound faces, to creating entirely new categories of visual abilities. As an example of, waves from the vibrations they induce in ob objects jects visible in a video (Davis et al. the latter category , one recent computer vision application to fo recognize 2014 2014). ). Most deep learning researc research h on computer vision has isnot focused cused onsound suc such h w a v es from the vibrations they induce in ob jects visible in a video ( Davis et al. exotic applications that expand the realm of what is possible with imagery but, 2014). Most deep learning research on computer vision has not focused on such 455 of what is p ossible with imagery but exotic applications that expand the realm
CHAPTER 12. APPLICATIONS
rather a small core of AI goals aimed at replicating human abilities. Most deep learning for computer vision is used for ob object ject recognition or detection of some rather a small core of AI goals aimed at replicating human abilities. deep form, whether this means rep reporting orting which ob object ject is presen present t in an image, Most annotating learning computer vision is used for eac ob ject recognition or detection of some an imagefor with bounding boxes around each h ob object, ject, transcribing a sequence of form, whether this means rep orting which ob ject is presen t in an image, annotating sym symb bols from an image, or lab labeling eling eac each h pixel in an image with the iden identit tit tity y of the an image bounding boxesgenerative around eac hdeling ob ject,has transcribing a sequence of ob object ject it bwith elongs to. Because mo modeling been a guiding principle sym bols learning from an image, or there labeling each apixel withonthe identit y of the of deep research, is also largeinban odyimage of work image synthesis ob jectdeep it belongs to.While Because generative moexdeling been a guiding principle using mo models. dels. image synthesis nihilohas is usually not considered a of deep learning research, there is also a large b o dy of w ork on image synthesis computer vision endeav endeavor, or, mo models dels capable of image synthesis are usually useful for using deep models.a computer While image synthesis ex nihilorepairing is usually not considered a image restoration, vision task inv involving olving defects in images or computer vision models capable of image synthesis are usually useful for remo removing ving ob objects jectsendeav from or, images. image restoration, a computer vision task involving repairing defects in images or removing ob jects from images.
12.2.1
Prepro Preprocessing cessing
Man Many y application require sophisticated prepro preprocessing cessing because the original 12.2.1 Preproareas cessing input comes in a form that is difficult for man many y deep learning arc architectures hitectures to Man y application areas require sophisticated prepro cessing b ecause theoforiginal represen represent. t. Computer vision usually requires relativ relatively ely little of this kind preproinput comes in a form that is difficult for man y deep learning arc hitectures to cessing. The images should be standardized so that their pixels all lie in the same, represen t. Computer vision usually requires relativ ely little of this kind of preproreasonable range, like [0,1] or [-1, 1]. Mixing images that lie in [0,1] with images cessing. beresult standardized their pixels ine the that lie inThe [0, images 255] willshould usually in failure.soFthat ormatting imagesalltolie hav have the same, same reasonable like of [0,1] or [-1, 1]. Mixing lie in .[0,1] with images scale is the range, only kind prepro preprocessing cessing that isimages strictlythat necessary necessary. Many computer that liearc inhitectures [0, 255] will usually resultofina failure. images to hav e the ed same vision architectures require images standardFormatting size, so images must be cropp cropped or scale is the only kind of prepro cessing that is strictly necessary . Many computer scaled to fit that size. How However, ever, even this rescaling is not alwa always ys strictly necessary necessary.. vision arc hitectures require images of a standard size, so images must b e cropp ed or Some conv convolutional olutional mo models dels accept variably-sized inputs and dynamically adjust scaled that psize. How ever, even this the rescaling is size not alwa ys strictly necessary the sizetooffittheir ooling regions to keep output constant (Waib aibel el et al. al.,,. Some olutional modelsmo accept variably-sized inputs and dynamically adjust 1989 1989). ). conv Other conv convolutional olutional models dels hav have e variable-sized output that automatically the size of their p o oling regions to keep the output size constant ( W aib el scales in size with the input, such as mo models dels that denoise or lab label el eac each h pixel et in al. an, 1989 ). (Other conv image Hadsell et olutional al., 2007).models have variable-sized output that automatically scales in size with the input, such as models that denoise or label each pixel in an Dataset augmentation may be seen as a way of prepro preprocessing cessing the training set image (Hadsell et al., 2007). only only.. Dataset augmen augmentation tation is an excellent wa way y to reduce the generalization error Dataset augmentation may b e seen as a w a of preprocessing set of most computer vision mo models. dels. A related ideayapplicable at test the timetraining is to show onlymo . Dataset augmen tation is an excellent wayinput to reduce the generalization error the model del man many y differen different t versions of the same (for example, the same image of most models. A related idea at instantiations test time is toofshow cropp cropped ed computer at slightlyvision different lo locations) cations) and ha hav ve applicable the different the the mo del man y differen t v ersions of the same input (for example, the same image mo model del vote to determine the output. This latter idea can be in interpreted terpreted as an cropp ed at slightly different lo cations) and ha v e the different instantiations of the ensem ensemble ble approach, and helps to reduce generalization error. model vote to determine the output. This latter idea can be interpreted as an Other kinds of prepro preprocessing cessing are applied to both the train and the test set with ensemble approach, and helps to reduce generalization error. the goal of putting eac each h example into a more canonical form in order to reduce the Other of prepro are needs applied both the and thethe test set with amoun amount t ofkinds variation that cessing the mo model del totoaccount for.train Reducing amoun amount t of the goal of putting eac h example into a more canonical form in order to reduce variation in the data can both reduce generalization error and reduce the sizethe of amount of variation that the model needs to account for. Reducing the amount of 456 variation in the data can both reduce generalization error and reduce the size of
CHAPTER 12. APPLICATIONS
the mo model del needed to fit the training set. Simpler tasks ma may y be solved by smaller mo models, dels, and simpler solutions are more lik likely ely to generalize well. Prepro Preprocessing cessing the mo del needed to fit the training set. Simpler tasks ma y b e solved b y smaller of this kind is usually designed to remov removee some kind of variabilit ariability y in the input mo dels, and simpler solutions are more lik ely to generalize w ell. Prepro cessing data that is easy for a human designer to describ describee and that the human designer of this kind is usually designed to remov e some kind of v ariabilit y in the is confident has no relev relevance ance to the task. When training with large datasetsinput and data that is easy a human designer to often describ e and that, and the human large mo models, dels, this for kind of prepro preprocessing cessing is unnecessary unnecessary, it is bestdesigner to just is confident has no relev ance to the When trainingbecome with large datasets let the mo model del learn which kinds of vtask. ariability it should inv invarian arian ariant t to. and For large mo dels, this kind of prepro cessing is often unnecessary , and it is b est to just example, the AlexNet system for classifying ImageNet only has one prepro preprocessing cessing let the mo del learn which kinds of v ariability it should b ecome inv arian t to. For step: subtracting the mean across training examples of each pixel (Krizhevsky example, the et al. al.,, 2012 ). AlexNet system for classifying ImageNet only has one preprocessing step: subtracting the mean across training examples of each pixel (Krizhevsky et al., 2012). 12.2.1.1 Con Contrast trast Normalization 12.2.1.1 trast Normalization One of theCon most obvious sources of variation that can be safely remov removed ed for man many y tasks is the amount of con contrast trast in the image. Contrast simply refers to the One of the ofmost obvious sources of the variation that the candark be safely ed for magnitude the difference betw etween een bright and pixels remov in an image. many tasks is theways amount of con trastthe in the image.ofContrast simply to the There are many of quan quantifying tifying contrast an image. In therefers context of magnitude of the difference b etw een the bright and the dark pixels in an image. deep learning, con contrast trast usually refers to the standard deviation of the pixels in an There are many w of quanSupp tifying an image. Inted theby context of image or region ofays an image. Suppose osethe wecontrast ha have ve an of image represen represented a tensor deep conX trast usually refers intensit totensit the standard of the in an r×c×3 j , Xpixels X ∈ Rlearning, , with tensity y at row ideviation and column i,j,1 b eing the red in i,j,2 giving image or region ofy an image. Suppose we have an image represen by a tensor the green in intensit tensit tensity intensity y. Then the ted contrast of the R X X and Xi,j,3 giving the blue intensit X i j , with b eing the red in tensit y at row and column , giving en entire tire image is given by X the ∈ green intensity and v giving the blue intensity. Then the contrast of the u r X c X 3 entire image is given by u X ¯ 2 t 1 X i,j,k − X (12.1) 3rc 1 i=1 j=1 k=1 X X ¯ (12.1) 3 r c ¯ where X is the mean in intensity tensity of the entire image:− X r X c image: 3 X X where ¯ is the mean intensity entire v of the ¯= 1 uX Xi,j,k . (12.2) X u 3rcX X t i=1 j=1 k=1 X X ¯= 1 . (12.2) 3r c Glob Global al contr ontrast ast normalization (GCN) aims to preven preventt images from having varying amounts of con contrast trast by subtracting the mean from each image, then Glob al c ontr ast normalization aims to preven t images fromtohaving rescaling it so that the standard (GCN) deviation across its pixels is equal some X X Xthe mean from each image, then vconstan arying amounts of con trast by subtracting constantt s. This approac approach h is complicated by the fact that no scaling factor can rescaling it so that the deviation its pixels is equal to some change the con contrast trast of a standard zero-con zero-contrast trast imageacross (one whose pixels all hav have e equal constan t s.Images This approac h islo complicated bycon thetrast factoften that hav no escaling factor can in intensit tensit tensity). y). with very low w but non-zero contrast have little information cconten hange contrast a true zero-con trast image (oneusually whose accomplishes pixels all havnothing e equal con ten tent. t.the Dividing byofthe standard deviation in tensit y). Images with very lo w but non-zero con trast often hav e little information more than amplifying sensor noise or compression artifacts in suc such h cases. This content. Dividing by the true standard deviation usually accomplishes nothing more than amplifying sensor noise or 457 compression artifacts in such cases. This
CHAPTER 12. APPLICATIONS
motiv motivates ates introducing a small, positive regularization parameter λ to bias the estimate of the standard deviation. Alternately Alternately,, one can constrain the denominator λ toimage motiv ates introducing a small, p ositive regularization parameter bias the X0 , to be at least . Giv Given en an input image X , GCN pro produces duces an output estimatesuch of the standard deviation. Alternately, one can constrain the denominator defined that X X to be at least . Given an input image , GCN produces an output image , ¯ − X X i,j,k 0 definedXsuch that q . s (12.3) i,j,k = 2 P P rX P c X 3 1 ¯ max , λ + 3rc i=1 j=1¯ k=1 X i,j,k − X X =s . (12.3) − X X ¯ max , λ + Datasets consisting of large images cropp cropped ed to interesting ob objects jects are unlikely − to contain any images with nearly constant in intensit tensit tensity y. In these cases, it is safe Datasets consisting of large images cropp ed to interesting ob jects unlikely λ = 0are to practically ignore the small denominator problem by setting and avoid to contain nearly constant in tensit y . In these cases, it is safe with division by any 0 in images extremely rare cases by setting to an extremely lo v alue like low w q P P P λ to practically ignore the small denominator problem b y setting = 0 and a v oid −8 10 . This is the approach used by Go Goo odfellow et al. (2013a) on the CIF CIFAR-10 AR-10 division by 0 inimages extremely rare cases by setting an extremely low value like tolikely dataset. Small cropp cropped ed randomly are more to ha have ve nearly constant 10 . This is the approach used by Go o dfellow et al. ( 2013a ) on the CIF AR-10 in intensit tensit tensity y, making aggressiv aggressivee regularization more useful. Coates et al. (2011) used dataset. Small images cropprandomly ed randomly are more likely to ve nearly constant = 0 and λ = 10 on small, selected patc patches hes dra drawn wnha from CIF CIFAR-10. AR-10. intensity, making aggressive regularization more useful. Coates et al. (2011) used The scale parameter s can usually be set to 1, as done by Coates et al. (2011), = 0 and λ = 10 on small, randomly selected patches drawn from CIFAR-10. or chosen to mak makee each individual pixel ha hav ve standard deviation across examples s The scale parameter can usually b e set to).1, as done by Coates et al. (2011), close to 1, as done by Go Goo odfellow et al. (2013a or chosen to make each individual pixel have standard deviation across examples L2 norm of the The standard deviation in Eq. 12.3 is just a rescaling of the close to 1, as done by Goodfellow et al. (2013a). image (assuming the mean of the image has already been remov removed). ed). It is preferable L norm The standard deviation in Eq. 12.3 is just a rescaling of the of the 2 to define GCN in terms of standard deviation rather than L norm because the image (assuming the mean of the image has already been remov ed). It is preferable standard deviation includes division by the num numb ber of pixels, so GCN based on to define deviation GCN in terms standard deviation thanofLimage normsize. because the standard allowsofthe same s to be used rather regardless How However, ever, standard deviation division by the numbto er of so GCN basedcan on the observ observation ation thatincludes the L 2 norm is prop proportional ortional thepixels, standard deviation s tounderstand standard the same be used regardless image size. However, help builddeviation a useful allows intuition. One can GCN asofmapping examples to L the observ ation that the norm is prop ortional to the standard deviation can a spherical shell. See Fig. 12.1 for an illustration. This can be a useful prop property erty help build a useful intuition. One can understand GCN as mapping examples to because neural netw are often b etter at resp to directions in space rather networks orks responding onding a spherical See Fig. 12.1 for an This caninbthe e a useful property than exact shell. lo locations. cations. Resp Responding onding toillustration. multiple distances same direction because neural orkswith are often better at resp to directions space rather requires hiddennetw units collinear weigh weight t vonding ectors but differen differenttinbiases. Such than exact lo cations. Resp onding to multiple distances in the same direction co coordination ordination can be difficult for the learning algorithm to discov discover. er. Additionally Additionally,, requires hidden units with collinear weigh t v ectors but differen t biases. Such man many y shallow graphical mo models dels ha hav ve problems with representing multiple separated co ordination cansame be difficult for the learning to reducing discover. each Additionally mo modes des along the line. GCN av avoids oids these algorithm problems by example, man shallow graphical moadels have problems with representing multiple separated to a ydirection rather than direction and a distance. modes along the same line. GCN avoids these problems by reducing each example Counterin terin terintuitiv tuitiv tuitively ely ely,than , therea direction is a prepro preprocessing cessing op operation eration kno known wn as sphering and it to aCoun direction rather and a distance. is not the same op operation eration as GCN. Sphering do does es not refer to making the data lie terintuitiv , there is atoprepro cessing operationcomp known astssphering it on aCoun spherical shell,ely but rather rescaling the principal componen onen onents to hav havee and equal not thesosame eration asariate GCN.normal Sphering does not refer thespherical data lie visariance, thatop the multiv ultivariate distribution used to by making PCA has on a spherical shell, but rather to rescaling the principal comp onen ts to hav e equal con contours. tours. Sphering is more commonly known as whitening. variance, so that the multivariate normal distribution used by PCA has spherical 458 contours. Sphering is more commonly known as whitening.
CHAPTER 12. APPLICATIONS
x1
1.5
Ra Raw w input
GCN, λ = 0
GCN, λ = 10−2
Raw input
GCN, λ = 0
GCN, λ = 10
0.0
−1.5 −1.5
0.0 x0
1.5
−1.5
0.0 x0
1.5
−1.5
0.0 x0
1.5
Figure 12.1: GCN maps examples on onto to a sphere. (L (Left) eft) Raw input data may hav havee any norm. (Center) GCN with λ = 0 maps all non-zero examples p erfectly onto a sphere. (Left) Figure maps onto awsphere. Raw data may e any =examples 10−8 . Because Here we12.1: use sGCN = 1 and e use GCN based oninput normalizing the hav standard (Center) 2λ = 0 maps all non-zero examples p erfectly onto a sphere. norm. GCN with deviation rather than the L norm, the resulting sphere is not the unit sphere. (Right) s = 1 and Here we use GCN, . Because we use GCN based on normalizing the standard Regularized with λ=>100, draws examples tow toward ard the sphere but do does es not completely L norm. deviation thaninthe norm, W the resulting sphere not as theb efore. unit sphere. (Right) discard therather variation their e leav leave e s and the issame Regularized GCN, with λ > 0, draws examples toward the sphere but do es not completely discard the variation in their norm. We leave s and the same as b efore.
Global contrast normalization will often fail to highlight image features we would lik likee to stand out, such as edges and corners. If we hav havee a scene with a large Global contrast normalization will often fail to highlight features dark area and a large bright area (suc (such h as a city square withimage half the image we in w ould likewtoofstand out, such as global edges contrast and corners. If we havewill a scene with a large the shado shadow a building) then normalization ensure there is a dark area and abetw large area (suchof as city square with the image in large difference etween eenbright the brightness thea dark area and thehalf brightness of the the a building) then global normalization will region ensure stand there out. is a ligh lighttshado area.wItofwill not, how however, ever, ensure contrast that edges within the dark large difference between the brightness of the dark area and the brightness of the This motiv motivates ates lo loccal contr ontrast ast normalization normalization.. Local con contrast trast normalization light area. It will not, however, ensure that edges within the dark region stand out. ensures that the contrast is normalized across each small window, rather than over local normalization . Local contrast the This imagemotiv as a ates whole. Seecontr Fig.ast 12.2 for a comparison of global and normalization lo local cal contrast ensures that the contrast is normalized across each small window, rather than over normalization. the image as a whole. See Fig. 12.2 for a comparison of global and local contrast Various definitions of lo local cal contrast normalization are possible. In all cases, normalization. one mo modifies difies eac each h pixel by subtracting a mean of nearby pixels and dividing by V arious definitions localycontrast normalization Inthe all mean cases, a standard deviation ofofnearb nearby pixels. In some cases, are thispisossible. literally one mo difies eac h pixel by subtracting a mean of nearby pixels and dividing by and standard deviation of all pixels in a rectangular windo window w centered on the a standard deviation of nearb the pixel to be mo modified dified (Pinto et yal.pixels. , 2008).InInsome othercases, cases,this thisisisliterally a weigh weighted ted mean mean and weigh standard deviationdeviation of all pixels a rectangular w centered on the and weighted ted standard usinginGaussian weigh weights tswindo centered on the pixel to pixel to b e mo dified ( Pinto et al. , 2008 ). In other cases, this is a weigh ted mean be mo modified. dified. In the case of color images, some strategies pro process cess different color and weigh ted standard deviation using Gaussian weigh ts centered ont the pixel to to channels separately while others combine information from differen different channels be modified. the case of color strategies process different color normalize eachInpixel (Sermanet et images, al., 2012some ). channels separately while others combine information from different channels to normalize each pixel (Sermanet et al., 2012). 459
CHAPTER 12. APPLICATIONS
Input image
GCN
LCN
Figure 12.2: AInput comparison local cal contrast normalization. Visually Visually,, the effects image of global and lo GCN LCN of global contrast normalization are subtle. It places all images on roughly the same Figurewhich 12.2: reduces A comparison of global andlearning lo cal contrast normalization. Visually , the effects scale, the burden on the algorithm to handle multiple scales. Lo Local cal of global contrast normalization areimage subtle. Ith places all images all onregions roughlyofthe same con contrast trast normalization mo modifies difies the muc much more, discarding constant scale, which reduces the the burden on the learning to handle multiple Lo cal in intensity tensity tensity. . This allo allows ws mo model del to focus on algorithm just the edges. Regions of scales. fine texture, conhtrast normalization mosecond difies the image more,detail discarding of constant suc such as the houses in the row, may muc losehsome due to all theregions bandwidth of the intensity. Thiskernel allows the to mo to focus on just the edges. Regions of fine texture, normalization b eing too o del high. such as the houses in the second row, may lose some detail due to the bandwidth of the normalization kernel b eing to o high.
Lo Local cal contrast normalization can usually be implemented efficien efficiently tly by using separable conv convolution olution (see Sec. 9.8) to compute feature maps of lo local cal means and cal contrast normalization can usually be implemented tly by t-wise using lo local calLostandard deviations, then using elemen element-wise t-wise subtractionefficien and elemen element-wise separableonconv olutionfeature (see Sec. 9.8) to compute feature maps of local means and division different maps. local standard deviations, then using element-wise subtraction and element-wise Lo Local cal con contrast trast normalization is a differen differentiable tiable op operation eration and can also be used as division on different feature maps. a nonlinearity applied to the hidden lay layers ers of a net netw work, as well as a prepro preprocessing cessing Local con trast normalization op operation eration applied to the input. is a differentiable operation and can also be used as a nonlinearity applied to the hidden layers of a network, as well as a preprocessing As with globaltocon contrast trast normalization, we typically need to regularize lo local cal operation applied the input. con contrast trast normalization to av avoid oid division by zero. In fact, because lo local cal contrast As with global con trast normalization, we typically need to regularize local normalization typically acts on smaller windo windows, ws, it is even more important to contrast normalization to avoid by zero. In fact, because calall contrast regularize. Smaller windows are division more likely to contain values that lo are nearly normalization typically actsthus on smaller windo it is even more important the same as each other, and more lik likely ely tows, hav have e zero standard deviation. to regularize. Smaller windows are more likely to contain values that are all nearly the same as each other, and thus more likely to have zero standard deviation. 12.2.1.2 Dataset Augmentation 12.2.1.2 Dataset As describ described ed in Sec. Augmentation 7.4, it is easy to impro improv ve the generalization of a classifier by increasing the size of the training set by adding extra copies of the training As describthat ed inhav Sec. it isdified easy with to impro ve the generalization of cahange classifier examples have e b7.4 een, mo modified transformations that do not the by increasing the size of the training set by adding extra copies of the training examples that have been modified with transformations that do not change the 460
CHAPTER 12. APPLICATIONS
class. Ob Object ject recognition is a classification task that is esp especially ecially amenable to this form of dataset augmen augmentation tation because the class is in inv varian ariantt to so man many y class. Ob ject recognition is a classification task that is esp ecially amenable to transformations and the input can be easily transformed with man many y geometric this form of dataset augmen tation b ecause the class is in v arian t to so many op operations. erations. As describ described ed before, classifiers can benefit from random translations, transformations and the input can be easily transformed with man y geometric rotations, and in some cases, flips of the input to augment the dataset. In sp specialized ecialized operations.vision As describ ed before, classifiers can benefit from random translations, computer applications, more adv advanced anced transformations are commonly used rotations, and in some cases, flips of the input to augment the dataset. In sp ecialized for dataset augmentation. These schemes include random perturbation of the computer vision applications, anced transformations are commonly used colors in an image (Krizhevskymore et al.adv , 2012 ) and nonlinear geometric distortions of for dataset augmentation. These schemes include random p erturbation of the the input (LeCun et al., 1998b). colors in an image (Krizhevsky et al., 2012) and nonlinear geometric distortions of the input (LeCun et al., 1998b).
12.3
Sp Speec eec eech h Recognition
The of sp speech eech is to map an acoustic signal containing a sp spoken oken 12.3task Sp eec h recognition Recognition natural language utterance into the corresp corresponding onding sequence of words intended by The task of sp eech acoustic containing a spinput oken X recognition = ((x x(1) , x(2)is, .to . . ,map x(T ) )an the sp speaker. eaker. Let denote the signal sequence of acoustic language utterance into the corresponding sequence of words intended by vnatural ectors (traditionally pro produced duced by splitting the audio into 20ms frames). Most = ( x ,prepro x , . .cess . , x the) input theeec spheaker. Let Xsystems denoteusing the sequence of acoustic input sp speec eech recognition preprocess sp specialized ecialized hand-designed vfeatures, ectors (traditionally pro duced by splitting the audio into 20ms frames). Most but some (Jaitly and Hinton, 2011) deep learning systems learn features speecra hw recognition input spoutput ecialized hand-designed y = ((y y1, prepro y 2 , . . . cess , yN )the from raw input. Letsystems denote theusing target sequence (usually features, but some ( Jaitly and Hinton , 2011 ) deep learning systems learn features a sequence of words or characters). The automatic sp speeech recognition (ASR) task y = ( y , y , . . . , y from ra w input. Let ) denote the target output sequence (usually ∗ consists of creating a function fASR that computes the most probable linguistic a sequence wen ords characters). TheX automatic speech recognition (ASR) task sequence y of giv given theoracoustic sequence : consists of creating a function f that computes the most probable linguistic ∗ sequence y given the acoustic X :P ∗(y | X = X ) f ASR(Xsequence ) = arg max (12.4) y
f (X ) = arg max P (y X = X ) (12.4) where P ∗ is the true conditional distribution relating the inputs X to the targets | y. where P is the true conditional distribution relating the inputs X to the targets Since the 1980s and un until til ab about out 2009–2012, state-of-the art sp speec eec eech h recognition y. systems primarily combined hidden Marko Markov v mo models dels (HMMs) and Gaussian mixture Since the 1980s and un til ab out 2009–2012, state-of-the speechfeatures recognition mo models dels (GMMs). GMMs mo modeled deled the asso association ciation betw etween een art acoustic and systems primarily combined hidden Marko v mo dels (HMMs) and Gaussian mixture phonemes (Bahl et al., 1987), while HMMs mo modeled deled the sequence of phonemes. models (GMMs). GMMs deledtreats the asso ciationwbaetw een acoustic features and The GMM-HMM model mo family acoustic veforms as being generated phonemes (wing Bahl pro et al. , 1987 ), while HMMs modeled the sequence of phonemes. b y the follo following process: cess: first an HMM generates a sequence of phonemes and The GMM-HMM model family treats acoustic w a v eforms as b eing generated discrete sub-phonemic states (such as the beginning, middle, and end of eac each h b y the follo wing pro cess: first an HMM generates a sequence of phonemes and phoneme), then a GMM transforms each discrete symbol into a brief segment of discrete sub-phonemic states (such as the beginning, middle, anduntil end recen of eac h, audio wa waveform. veform. Although GMM-HMM systems dominated ASR recently tly tly, phoneme), then a GMM transforms each symbol a brief segment of sp speec eec eech h recognition was actually one of thediscrete first areas whereinto neural net netw w orks were audio waand veform. Although systems dominated ASR until recen tly, applied, numerous ASRGMM-HMM systems from the late 1980s and early 1990s used speech recognition was actually one of the first areas where neural networks were applied, and numerous ASR systems from 461 the late 1980s and early 1990s used
CHAPTER 12. APPLICATIONS
neural nets (Bourlard and Wellekens, 1989; Waib aibel el et al. al.,, 1989; Robinson and Fallside, 1991; Bengio et al. al.,, 1991, 1992; Konig et al. al.,, 1996). At the time, the neural nets (of Bourlard andon Wneural ellekens , 1989 ; Waibel et matc al., 1989 ; Robinson and p erformance ASR based nets approximately matched hed the performance Fallside , 1991; Bengio et al. , 1991 , 1992;Robinson Konig et and al., F 1996 ). At the) time, the of GMM-HMM systems. For example, allside (1991 ac achieved hieved p erformance of ASR based on neural nets approximately matc hed the p erformance 26% phoneme error rate on the TIMIT (Garofolo et al., 1993) corpus (with 39 of GMM-HMM systems. Fbor example, Robinson and Fthan allsideor(1991 ) achieved phonemes to discriminate etw etween), een), whic which h was better comparable to 26% phoneme error rate on the TIMIT ( Garofolo et al. , 1993 ) corpus (with 39 HMM-based systems. Since then, TIMIT has been a benc enchmark hmark for phoneme phonemes to discriminate b etw een), whic h was better than or ject comparable to recognition, playing a role similar to the role MNIST plays for ob object recognition. HMM-based systems. Since then, TIMIT has b een a b enc hmark for phoneme Ho How wev ever, er, because of the complex engineering inv involved olved in softw software are systems for recognition, playingand a role to the MNIST plays for ob ject recognition. sp speec eec eech h recognition the similar effort that hadrole been in inv vested in building these systems Ho w ev er, b ecause of the complex engineering inv olved in softw are systems fort on the basis of GMM-HMMs, the industry did not see a comp compelling elling argumen argument speec h recognition and the effort been investedun intilbuilding systems for switching to neural netw networks. orks.that Ashad a consequence, until the latethese 2000s, both on the basis of GMM-HMMs, the industry did not see a comp elling argumen academic and industrial research in using neural nets for sp speec eec eech h recognition mostlyt for switching to neural Asextra a consequence, til the late systems. 2000s, both fo focused cused on using neural netw nets orks. to learn features forun GMM-HMM academic and industrial research in using neural nets for speech recognition mostly Later, much larger datasets, uc uch h larger deeper modelsforand focused on with usingmneural nets toand learndeep extraerfeatures GMM-HMM systems. recognition accuracy was dramatically impro improv ved by using neural net netw works to m uc h larger and deep er models Later, with and m uch larger replace GMMs for the task of asso associating ciating acoustic features to phonemesdatasets, (or subrecognitionstates). accuracy was dramatically impro ved by using neural netwof orks to phonemic Starting in 2009, sp speech eech researchers applied a form deep replace GMMs task of asso ciating to acoustic to phonemes (or sublearning based for on the unsupervised learning sp speech eechfeatures recognition. This approac approach h phonemic states). Starting in 2009, sp eech researchers applied a form of deep to deep learning was based on training undirected probabilistic models called learning based on unsupervised learningtotomo spdel eech approac restricted Boltzmann mac machines hines (RBMs) model therecognition. input data. This RBMs will bhe to deepedlearning based onsp training undirected probabilistic models called describ described in Part Iwas I I. T o solve speech eech recognition tasks, unsup unsupervised ervised pretraining restricted Boltzmann hines (RBMs) to mowhose del thela input data.each RBMs will be w as used to build deepmac feedforw feedforward ard netw networks orks lay yers were initialized describ ed in an Part I I I. T o solvenetw speech tasks, unsupervised pretraining b y training RBM. These networks orksrecognition take sp spectral ectral acoustic represen representations tations in w as used to build deep feedforw ard netw orks whose la y ers were each initialized a fixed-size input window (around a center frame) and predict the conditional by training an These for netw orkscenter take frame. spectralTraining acousticsuch represen in probabilities of RBM. HMM states that deeptations netw networks orks a fixed-size input window (around a center frame) andTIMIT predict(Mohamed the conditional help helped ed to significan significantly tly improv improve e the recognition rate on et al., probabilities of HMM states for that center frame. T raining such deep netw orks 2009, 2012a), bringing down the phoneme error rate from ab about out 26% to 20.7%. help ed to significan tly improv e the recognition rate on TIMIT ( Mohamed et al., See Mohamed et al. (2012b) for an analysis of reasons for the success of these 2009 , 2012a ), bringing down thephone phoneme error rate from about 26% 20.7%. mo Extensions to the basic recognition pip thetoaddition models. dels. pipeline eline included Seespeaker-adaptiv Mohamed et al. (2012b) (for an analysis of, reasons for the success of these of speaker-adaptive e features Mohamed et al. 2011) that further reduced the mo dels. Extensions to the basic phone recognition pip eline included the addition error rate. This was quic quickly kly follo follow wed up by work to expand the architecture from of speaker-adaptiv e features ( Mohamed et al.is, fo 2011 ) that reduced the phoneme recognition (which is what TIMIT focused cused on) further to large-v large-vo ocabulary error This was quicet klyal.follo wed up by in work to expand the architecture from sp speec eec eech hrate. recognition (Dahl , 2012 ), which inv volv olves es not just recognizing phonemes phoneme recognitionsequences (which isofwhat TIMIT focused on) to large-v cabulary but also recognizing words from ais large vocabulary vocabulary. . Deeponet networks works sp eec h recognition ( Dahl et al. , 2012 ), which in v olv es not just recognizing phonemes for sp speech eech recognition ev eventually entually shifted from being based on pretraining and but also recognizing sequences of won ords from a large Deep units networks Boltzmann machines to being based techniques suc such hvocabulary as rectified. linear and for sp eech recognition ev entually shifted from b eing based on pretraining and drop dropout out (Zeiler et al., 2013; Dahl et al., 2013). By that time, several of the ma major jor Boltzmann machines to being techniques suclearning h as rectified linear unitswith and sp speec eec eech h groups in industry had based startedonexploring deep in collab collaboration oration dropout (Zeiler et al., 2013; Dahl et al., 2013). By that time, several of the ma jor speech groups in industry had started exploring deep learning in collaboration with 462
CHAPTER 12. APPLICATIONS
academic researchers. Hinton et al. (2012a) describ describee the breakthroughs achiev achieved ed by these collab collaborators, orators, whic which h are no now w deplo deploy yed in pro products ducts suc such h as mobile phones. academic researchers. Hinton et al. (2012a) describe the breakthroughs achieved Later, as these groups explored larger and larger lab labeled eled datasets and incorp incorpooby these collaborators, which are now deployed in products such as mobile phones. rated some of the metho methods ds for initializing, training, and setting up the arc architecture hitecture Later, as these groups explored larger and larger lab eled datasets and oof deep nets, they realized that the unsup unsupervised ervised pretraining phase wasincorp either rated some oforthe ds foran initializing, training, and setting up the architecture unnecessary didmetho not bring any y significant improv improvement. ement. of deep nets, they realized that the unsupervised pretraining phase was either These breakthroughs in recognition performance for word error rate in sp speech eech unnecessary or did not bring any significant improvement. recognition were unprecedented (around 30% improv improvement) ement) and were following a These breakthroughs in recognition p erformance for word rate in sp eech long perio eriod d of ab about out ten years during whic which h error rates did not error improv improve e muc much h with recognition wereGMM-HMM unprecedented (around, in 30% improv ement) and weregro following the traditional technology technology, spite of the con contin tin tinuously uously growing wing sizea long periodsets of ab outFig. ten years whicYhuerror did not improv e mucshift h with of training (see 2.4 ofduring Deng and (2014rates )). This created a rapid in the traditional GMM-HMM technology , in spite of the con tin uously gro wing size the sp speech eech recognition comm communit unit unity y to tow wards deep learning. In a matter of roughly of training sets (see Fig. 2.4 of Deng and u (sp 2014 )).recognition This created a rapid shift in two years, most of the industrial pro products ducts Y for speech eech incorp incorporated orated deep the speech unitspurred y towards deep learning. In ah matter of learning roughly neural net netw wrecognition orks and thiscomm success a new wa wave ve of researc research in into to deep two years, most of the industrial ducts for sp incorp algorithms and architectures for pro ASR, which is eech still recognition ongoing to toda da day y. orated deep neural networks and this success spurred a new wave of research into deep learning One of these inno innov vations was the use of conv convolutional olutional netw networks orks (Sainath et al., algorithms and architectures for ASR, which is still ongoing today. 2013 2013)) that replicate weigh weights ts across time and frequency frequency,, impro improving ving over the earlier One ofy these inno vations as the use of conv olutional orkstime. (Sainath al., time-dela time-delay neural netw networks orks w that replicated weigh eights ts onlynetw across Theetnew 2013 ) that replicate weigh ts across time regard and frequency , impro ving over not the as earlier two-dimensional conv convolutional olutional mo models dels the input sp spectrogram ectrogram one time-dela y neural netw orks that replicated w eigh ts only across time. The new long vector but as an image, with one axis corresp corresponding onding to time and the other to tfrequency wo-dimensional conv olutional mo dels regard the input spectrogram not as one of sp spectral ectral comp components. onents. long vector but as an image, with one axis corresponding to time and the other to Another imp importan ortan ortantt push, still ongoing, has been tow towards ards end-to-end deep frequency of spectral components. learning sp speech eech recognition systems that completely remov removee the HMM. The first Another imp ortan t push, still ongoing, has b een tow end-to-end deep ma major jor breakthrough in this direction came from Grav Graves es et al.ards (2013 ) who trained a learning speech systems that completely e the HMM. The first deep LSTM RNNrecognition (see Sec. 10.10 ), using MAP inferenceremov ov the frame-to-phoneme over er ma jor breakthrough in this direction Grav es etframework al. (2013) (who trained a, alignmen alignment, t, as in LeCun et al. (1998bcame ) andfrom in the CTC Grav Graves es et al. al., deep; LSTM Sec. 10.10 ), using MAP er the frame-to-phoneme 2006 Grav Graves esRNN , 2012(see ). A deep RNN (Grav Graves es etinference al. al.,, 2013ov ) has state variables from alignmen t, as in LeCun et al. ( 1998b ) and in the CTC framework (Gravofesdepth: et al., sev several eral lay layers ers at eac each h time step, giving the unfolded graph two kinds 2006 ; Grav es , 2012 ). A deep RNN ( Grav es et al. , 2013 ) has state v ariables ordinary depth due to a stack of la lay yers, and depth due to time unfolding. from This sev eral layerst the at eac h time step, unfolded of depth: w ork brough brought phoneme error giving rate onthe TIMIT to a graph recordtwo lo low wkinds of 17.7%. See ordinary depth due to a stack of la y ers, and depth due to time unfolding. This Pascan ascanu u et al. (2014a) and Ch Chung ung et al. (2014) for other varian ariants ts of deep RNNs, w ork brough t the phoneme error rate on TIMIT to a record lo w of 17.7%. See applied in other settings. Pascanu et al. (2014a) and Chung et al. (2014) for other variants of deep RNNs, Another con contemp temp temporary orary step to tow ward end-to-end deep learning ASR is to let the applied in other settings. system learn how to “align” the acoustic-lev acoustic-level el information with the phonetic-level Another con temp orary step to w ard end-to-end information (Chorowski et al., 2014; Lu et al., 2015deep ). learning ASR is to let the system learn how to “align” the acoustic-level information with the phonetic-level information (Chorowski et al., 2014; Lu et al., 2015).
463
CHAPTER 12. APPLICATIONS
12.4
Natural Language Pro Processing cessing
Natur Natural pr pro ocessing (NLP) is Pro the use of human languages, such as English 12.4al language Natural Language cessing or Frenc rench, h, by a computer. Computer programs typically read and emit sp specialized ecialized Natur al language pr o c essing (NLP) is the use of human languages, such as English languages designed to allow efficien efficientt and unambiguous parsing by simple programs. or F renc h, b y a computer. Computer programs typically read and emit sp ecialized More naturally occurring languages are often am ambiguous biguous and defy formal description. languageslanguage designedpro to allow efficien t andapplications unambiguous parsing by simple programs. Natural processing cessing includes such as machine translation, More naturally occurring languages are often am and defy formaland description. in whic which h the learner must read a sentence in biguous one human language emit an Natural language pro cessing includes applications such as machine translation, equiv equivalen alen alentt sentence in another human language. Many NLP applications are based in whic h themo learner must readaaprobability sentence indistribution one humanov language andofemit an on language models dels that define over er sequences words, alent sentence human language. Many NLP applications are based cequiv haracters or bytes in in another a natural language. on language models that define a probability distribution over sequences of words, As with the other applications discussed in this chapter, very generic neural characters or bytes in a natural language. net netw work tec techniques hniques can be successfully applied to natural language pro processing. cessing. As with the other applications discussed in this chapter, v ery generic neural Ho How wev ever, er, to ac achiev hiev hievee excellent performance and to scale well to large applications, network techniques be successfully applied to Tnatural pro cessing. some domain-sp domain-specific ecificcan strategies become imp importan ortan ortant. t. o build language an efficient mo model del of Ho w ev er, to ac hiev e excellent p erformance and to scale w ell to large applications, natural language, we must usually use techniques that are specialized for pro processing cessing some domain-sp ecific strategies b ecome imp ortan t. T o build an efficient mo del of sequen sequential tial data. In many cases, we choose to regard natural language as a sequence natural language, w e m ust usually use techniques that are specialized for pro cessing of words, rather than a sequence of individual characters or bytes. Because the total sequen In many we choose to regard naturalmo language as op a sequence n um umb bertial of data. possible words cases, is so large, word-based language models dels must operate erate on of words, rather than a sequence of individual c haracters or bytes. Because the an extremely high-dimensional and sparse discrete space. Sev Several eral strategies total hav havee n um b er of p ossible words is so large, word-based language mo dels must op erate on been dev develop elop eloped ed to make mo models dels of such a space efficien efficient, t, both in a computational an extremely high-dimensional and sparse discrete space. Sev eral strategies hav e and in a statistical sense. been developed to make models of such a space efficient, both in a computational and in a statistical sense.
12.4.1
n-grams n 12.4.1 A language -grams mo model del defines a probabilit probability y distribution ov over er sequences of tokens in
a natural language. Dep Depending ending on ho how w the mo model del is designed, a token ma may y be A language mo del defines a probabilit y distribution ov er sequences of tokens in a word, a character, or even a byte. Tokens are alwa always ys discrete entities. The howbased the mo a natural language. Depending is designed, a tokensequences may be earliest successful language mo models delsonwere ondel mo models dels of fixed-length a word, character, or even byte. isTaokens are alwa discrete entities. The of tokensacalled n-grams. An na-gram sequence of n ys tok tokens. ens. earliest successful language models were based on models of fixed-length sequences Mo Models dels based on n-grams define the conditional probabilit probability y of the n-th token of tokens called n-grams. An n-gram is a sequence of n tokens. n − giv given en the preceding 1 tokens. The mo model del uses pro products ducts of these conditional Models based on n-grams define the conditionalov probabilit of the n-th token distributions to define the probabilit probability y distribution over er longerysequences: given the preceding n 1 tokens. The mo τ del uses pro ducts of these conditional Y distributions to define the probabilit y distribution er longer sequences: (12.5) P (x 1, . . . , xτ ) =−P (x1 , . . . , xn−1) P (xt | xov t−n+1 , . . . , xt−1 ). t=n
P (x ,osition . . . , x )is=justified P (x , . . by . , xthe )chainP rule (x of , . . . , x . The x probability ). probability (12.5) This decomp decomposition probability. | ) ma distribution over the initial sequence P ( x1, . . . , x n−1 may y be mo modeled deled by a differen differentt This decomp osition is justified by the c hain rule of probability . The probability mo model del with a smaller value of n. distribution over the initial sequence P ( x , . . . , x ) may be modeled by a different Y 464 model with a smaller value of n.
CHAPTER 12. APPLICATIONS
Training n-gram mo models dels is straigh straightforw tforw tforward ard because the maximum lik likeliho eliho elihoo od estimate can be computed simply by coun counting ting how many times each possible n Training models isset. straigh tforwbased ard because the maximum eliho od gram occurs nin-gram the training Mo Models dels on n -grams hav havee beenlik the core n estimate blo canckbeofcomputed by coun ting how each possible building block statisticalsimply language mo modeling deling for many man many y times decades (Jelinek and gram o,ccurs the, training set.and MoGo dels based on n).-grams have been the core Mercer 1980;inKatz 1987; Chen Goo odman , 1999 building block of statistical language modeling for many decades (Jelinek and For small values of n , mo models dels ha hav ve particular names: unigr unigram am for n=1, bigr bigram am Mercer, 1980; Katz, 1987; Chen and Goodman, 1999). for n=2, and trigr trigram am for n=3. These names derive from the Latin prefixes for the F or small v alues , mothe dels Greek have particular names: unigram for n=1, bigr corresp corresponding onding num numb bof ersn and suffix “-gram” denoting something thatam is n n for =2, and trigr am for =3. These names derive from the Latin prefixes for the written. corresponding numbers and the Greek suffix “-gram” denoting something that is Usually we train both an n-gram mo model del and an n − 1 gram model simultaneously simultaneously.. written. This makes it easy to compute Usually we train both an n-gram model and an n 1 gram model simultaneously. Pn (xt−n−+1, . . . , x t ) This makes it easy (12.6) P (xtot |compute x t−n+1, . . . , xt) = P n−1 (xt−n+1, . . . , x t−1) P (x ,...,x ) (12.6) P (xup xtwo stored , . . . , xprobabilities. )= simply by looking F or repro reproduce duce P (x , .this . . , xto exactly ) | inference in P n, we must omit the final character from each sequence when we simply by . looking up two stored probabilities. For this to exactly reproduce train P n−1 inference in P , we must omit the final character from each sequence when we AsPan example, we demonstrate how a trigram mo model del computes the probability train . of the sentence “ THE DOG RAN AWAY .” The first words of the sentence cannot be As anbyexample, we demonstrate how conditional a trigram mo del computes the probability handled the default form formula ula based probabilit probability y because there is no THE DOG RAN AWAY .”onThe of the sentence “ first words of the sentence cannot be con context text at the beginning of the sentence. Instead, we must use the marginal probhandled default form ula of based on conditional probabilit y because is no P3 (THE there DOG RAN RAN) abilit ability y ovby er the wor words ds at the start the sentence. We th thus us ev evaluate aluate ). con text at the b eginning of the sentence. Instead, we must use the marginal probFinally Finally,, the last word ma may y be predicted using the typical case, of using theRAN conTHE DOG P ( abilit y o v er wor ds at the start of the sentence. W e th us ev aluate ditional distribution P ( AWAY | DOG RAN). Putting this together with Eq. 12.6, we). Finally, the last word may be predicted using the typical case, of using the conobtain: ditional distribution P ( AWAY DOG RAN). Putting this together with Eq. 12.6, we P (THE DOG RAN AWAY) = P|3(THE DOG RAN)P 3 (DOG RAN AWAY)/P2 (DOG RAN). obtain: (12.7) P (THE DOG RAN AWAY) = P (THE DOG RAN)P (DOG RAN AWAY)/P (DOG RAN). Pn A fundamental limitation of maximum likelihoo likelihood d for n-gram mo models dels is that (12.7) as estimated from training set counts is ve very ry lik likely ely to be zero in many cases, even Po A fundamental limitation of maximum likelihoo d for models thattw though the tuple (x t−n+1, . . . , xt ) may app appear ear in the testn-gram set. This can is cause two as estimated training setoutcomes. counts is ve ry likPely toisbzero, e zerothe in ratio manyiscases, even differen different t kindsfrom of catastrophic When undefined, n−1 , . .pro . , xduce though the ) may appear output. in the test set.PThisiscan cause tw so the mo model deltuple do does es(x not ev even en produce a sensible When non-zero buto n−1 P oid is differen t kinds of catastrophic outcomes. zero, the ratio is outcomes, undefined, log-likelihood d is −∞When . T To o av avoid suc such h catastrophic P n is zero, the test log-likelihoo P so the mo del do es not ev en pro duce a sensible output. When is non-zero but most n-gram mo models dels employ some form of smo smoothing othing othing.. Smo Smoothing othing techniques shift test log-likelihoo d is tuples . Tto o av oid such catastrophic outcomes, P is zero,y the probabilit probability mass from the observed unobserved ones that are similar. n-gram mostChen dels employ . Smocomparisons. othing techniques shift −∞of smo See andmo Go Goo odman (1999some ) for form a review andothing empirical One basic probabilit mass from the observed to unobserved ones thatpossible are similar. tec technique hnique yconsists of adding non-zerotuples probability mass to all of the next See Chen and Go o dman ( 1999 ) for a review and empirical comparisons. One basic sym symb bol values. This metho method d can be justified as Ba Bay yesian inference with a uniform technique consists of adding non-zero probability mass to all of the possible next 465 as Bayesian inference with a uniform symbol values. This method can be justified
CHAPTER 12. APPLICATIONS
or Dirichlet prior ov over er the coun countt parameters. Another very popular idea is to form a mixture mo model del con containing taining higher-order and lo low wer-order n-gram mo models, dels, with the or Dirichlet prior ov er the coun t parameters. Another very p opular idea is tobform higher-order mo models dels pro providing viding more capacity and the low lower-order er-order mo models dels eing n a mixture mo del con taining higher-order and lo w er-order -gram mo dels, with the more likely to avoid coun counts ts of zero. Back-off metho methods ds lo look-up ok-up the lo low wer-order higher-order mo dels pro viding more capacity and the low er-order mo dels b eing n-grams if the frequency of the con context text xt−1, . . . , x t−n+1 is to too o small to use the more likely to adel. voidMore counformally ts of zero. Back-off ds look-up ov the xtwber-order higher-order mo model. formally, , they estimatemetho the distribution over er lo y using n x , . . . , x -grams if the frequency of the con text is to o small to use con contexts texts x t−n+k, . . . , xt−1, for increasing k , until a sufficiently reliable estimatethe is x higher-order mo del. More formally , they estimate the distribution ov er b y using found. , . . . , x , for increasing k , until a sufficiently reliable estimate is contexts x Classical n-gram mo models dels are particularly vulnerable to the curse of dimensionfound. alit ality y. There are |V| n possible n-grams and |V| is often very large. Ev Even en with a Classical -gram mo dels are particularly vulnerable to the curse of dimensionn massiv massivee training set modest dest n , most n-grams V and mo V will not occur in the training set. n alit y . There are p ossible -grams and very large. Even withbor a One wa way y to view a classical n-gram mo model del is thatisitoften is performing nearest-neigh nearest-neighb n n massiv e training set and mo dest , most -grams will not o ccur in the training set. | |a lo lo lookup. okup. In other| w| ords, it can be view viewed ed as local cal non-parametric predictor, n One wa y to view a classical -gram mo del is that it is performing bor similar to k -nearest neigh neighb bors. The statistical problems facing nearest-neigh these extremely lo okup. In other ords, itedcan be view ed. as a problem local non-parametric lo local cal predictors arewdescrib described in Sec. 5.11.2 The for a languagepredictor, mo model del is similar to -nearest neigh b ors. The statistical problems facing these extremely k ev even en more severe than usual, because any tw two o differen differentt words hav havee the same dislo cal predictors are describ ed in Sec. 5.11.2 . The problem for a momuc del is tance from eac each h other in one-hot vector space. It is thus difficultlanguage to leverage much h even more severe than usual, because any two differen t words haveat e the same disinformation from an any y “neigh “neighb bors”—only training examples that rep repeat literally the tance context from eacare h other in for one-hot ector space. It isT difficultthese to leverage much same useful lo local calvgeneralization. To othus overcome problems, a information from an y “neigh b ors”—only training examples that rep eat literally the language mo model del must be able to share kno knowledge wledge betw etween een one word and other same context are useful for lo cal generalization. T o o vercome these problems, a seman semantically tically similar words. language model must be able to share knowledge between one word and other To improv improvee the statistical efficiency of n-gram mo models, dels, class-b class-base ase ased d language semantically similar words. mo models dels (Bro Brown wn et al. al.,, 1992; Ney and Kneser, 1993; Niesler et al. al.,, 1998) introduce n T o improv e the statistical efficiency of -gram mo dels, class-b ase the notion of word categories and then share statistical strength betw between eendwlanguage ords that mo dels ( Bro wn et al. , 1992 ; Ney and Kneser , 1993 ; Niesler et al. , 1998 ) introduce are in the same category category.. The idea is to use a clustering algorithm to partition the the of notion of in word categories and thenbased shareon statistical strength betw een words with that set words into to clusters or classes, their co-o co-occurrence ccurrence frequencies are in words. the same category Thethen idea use is toword use aclass clustering algorithm to partitionword the other The mo model del. can IDs rather than individual set of ords into the clusters or classes, based on their ccurrence frequencies with IDs towrepresent context on the righ right t side of theco-o conditioning bar. Comp Composite osite other words. The mo del can then use w ord class IDs rather than individual w mo models dels combining word-based and class-based mo models dels via mixing or back-off ord are IDs to represent the context the righ t side of ythe bar. osite also possible. Although word on classes provide a wa way to conditioning generalize betw etween eenComp sequences mowhic dels hcombining class-based mosame dels via mixing back-off are in which some wordword-based is replaced and by another of the class, muc much h or information is also p ossible. Although word classes provide a wa y to generalize b etw een sequences lost in this representation. in which some word is replaced by another of the same class, much information is lost in this representation.
12.4.2
Neural Language Models
12.4.2 Neural Language Neur Neural al language mo models dels or NLMs Models are a class of language mo model del designed to overcome the curse of dimensionality problem for mo modeling deling natural language sequences by Neur models or NLMs are aofclass of language del, designed to overcome usingalalanguage distributed representation words (Bengio mo et al. 2001). Unlik Unlike e classthe curse of dimensionality problem for mo deling natural language sequences by based n -gram mo models, dels, neural language mo models dels are able to recognize that two words using a distributed representation of words (Bengio et al., 2001). Unlike class466dels are able to recognize that two words based n -gram models, neural language mo
CHAPTER 12. APPLICATIONS
are similar without losing the ability to enco encode de each word as distinct from the other. Neural language mo models dels share statistical strength bet etw ween one word (and are similar without losing the ability to enco de each word as distinct from the its con context) text) and other similar words and con contexts. texts. The distributed represen representation tation other. Neural language delsenables share statistical strength b etween word (and the mo model del learns for eachmo word this sharing by allo allowing wing the one mo model del to treat its con text) and other similar words and con texts. The distributed represen tation words that hav havee features in common similarly similarly.. For example, if the word dog and the mo delcat learns fortoeach word enablesthat this share sharing byyallo wing the then model totences treat the word map representations man many attributes, sen sentences dog w ordscontain that hav e features in can common . For example, if the that the word cat informsimilarly the predictions that will beword made by and the cat the word map to representations that share man y attributes, then sen tences mo model del for sentences that con contain tain the word dog, and vice-versa. Because there are cat that contain the word can the predictions that will be made by the man many y such attributes, there areinform many w ays in whic which h generalization can happ happen, en, dog model for sentences that con taineac the word , and vice-versa. Because therelarge are transferring information from each h training sen sentence tence to an exp exponentially onentially man attributes, there aresentences. many waThe ys incurse whicof h generalization happthe en, n um umb byersuch of semantically related dimensionality can requires transferring information eac training sen tence toonen an exp large mo model del to generalize to a nfrom um umb ber ofh sentences that is exp exponen onential tialonentially in the sentence n um b er of semantically related sentences. The curse of dimensionality requires length. The mo model del counters this curse by relating each training sentence to the an mo del totial generalize a numbsen er of sentences that is exponential in the sentence exp exponen onen onential num numb ber to of similar sentences. tences. length. The model counters this curse by relating each training sentence to an e sometimes call word representations wor word d emb embeeddings ddings.. In this interexpW onen tial number of these similar sentences. pretation, we view the ra raw w sym symb bols as poin oints ts in a space of dimension equal to the W e sometimes call these w ord representations word emb eddings In this space intervocabulary size. The word representations em emb bed those points in a. feature pretation, we view the w original symbols space, as poinevery ts in aword space dimension equal to the of low lower er dimension. In ra the is of represen represented ted by a one-hot √ vector, ocabulary size. Theofword representations emdistance bed those 2points in ah feature space v so ev every ery pair words is at Euclidean from eac each other. In the of low er dimension. In the original space, every word is represen ted by a one-hot em emb bedding space, words that frequently app appear ear in similar contexts (or any pair 2 from each other. In the vofector, so ev ery pair of w ords is at Euclidean distance words sharing some “features” learned by the mo model) del) √ are close to each other. embedding space,inww ords appear bin similar contexts any pair This often results ordsthat withfrequently similar meanings eing neigh neighb bors. Fig.(or 12.3 zo zooms oms of on words sharing some learned by the space model) close each tically other. in sp specific ecific areas of a “features” learned word em emb bedding to are show ho how wtoseman semantically This often results withtations similarthat meanings being ors. Fig. 12.3 zooms similar words mapintowords represen representations are close to neigh eac each h bother. in on specific areas of a learned word embedding space to show how semantically Neural net netw works in other domains also define embeddings. For example, a similar words map to representations that are close to each other. hidden lay layer er of a con conv volutional netw network ork provides an “image embedding.” Usually Neural net w orks in other domains also define For example, NLP practitioners are muc much h more in interested terested in thisembeddings. idea of em emb beddings becausea hidden lay er of a do con network provides an “image embedding.” natural language does esvolutional not originally lie in a real-v real-valued alued vector space. TheUsually hidden NLP practitioners are muc h more in terested in this idea of em b eddings b ecause la lay yer has pro provided vided a more qualitativ qualitatively ely dramatic change in the wa way y the data is natural language does not originally lie in a real-valued vector space. The hidden represen represented. ted. layer has provided a more qualitatively dramatic change in the way the data is The basic idea of using distributed represen representations tations to impro improv ve mo models dels for represented. natural language pro processing cessing is not restricted to neural net netw works. It may also be The basic idea of using distributed represen tations to improv models used with graphical mo models dels that hav havee distributed represen representations tations ine the formfor of natural processing is not to neural networks. It may also be m ultiplelanguage latent variables (Mnih andrestricted Hin Hinton ton, 2007 ). used with graphical models that have distributed representations in the form of multiple latent variables (Mnih and Hinton, 2007).
467
CHAPTER 12. APPLICATIONS
−6 −7 −8
22 France China Russian French English
21
−9 −10 −11 −12 −13 −14
20 Germany Iraq Ontario Europe EU Union Africa African Assembly
19 Japan
2009 2008 2004 2003
2006
2007 2001
2000 2005 1999 1995 2002 1997 19981996
European 18 BritishNorth Canada Canadian South 17 −34 −32 −30 −28 −26 35.0 35.5 36.0 36.5 37.0 37.5 38.0
Figure 12.3: Two-dimensional visualizations of word em emb b eddings obtained from a neural mac machine hine translation mo model del (Bahdanau et al., 2015), zo zooming oming in on sp specific ecific areas where Figure 12.3: T wo-dimensional visualizations of w ord em b eddings obtained fromCoun a neural seman semantically tically related words hav havee embedding vectors that are close to each other. Countries tries et al. mac hine translation mo del ( Bahdanau , 2015 ), zo oming in on sp ecific areas where app appear ear on the left and num numb b ers on the right. Keep in mind that these embeddings are 2-D seman related words have embedding vectors thatembeddings are close to typically each other. Coun tries for thetically purp purpose ose of visualization. In real applications, hav have e higher app ear on theyleft b ers on the right. Keep in mind these embeddings 2-D dimensionalit dimensionality andand cannum simultaneously capture many kindsthat of similarity b etw etween een are words. for the purp ose of visualization. In real applications, embeddings typically have higher dimensionality and can simultaneously capture many kinds of similarity b etween words.
12.4.3
High-Dimensional Outputs
In man many y natural language applications, we often wan antt our mo models dels to pro produce duce 12.4.3 High-Dimensional Outputs words (rather than characters) as the fundamental unit of the output. For large many natural language we often want to ourrepresent models to duce vIn ocabularies, it can be veryapplications, computationally exp expensive ensive anpro output w ords (rather haracters) asord, thebfundamental unit of the or man largey distribution ovthan er thecchoice of a w ecause the vocabulary sizeoutput. is large.FIn many vapplications, ocabularies, V it con can b e very computationally exp ensive to represent an output contains tains hundreds of thousands of words. The naiv naivee approach to distribution ver the choice of a wisord, because the vocabulary size is large. many represen representing ting osuch a distribution to apply an affine transformation from aInhidden V applications, conthe tains hundreds of thousands words. Thefunction. naive approach to represen representation tation to output space, then applyofthe softmax Supp Suppose ose represen a distribution is to|Vapply affinet transformation fromthe a hidden w e ha hav ve ting a vosuch cabulary weight matrix describing linear |. Theanweigh V with size represen tation to the output space, then apply the softmax function. Supp ose comp componen onen onentt of this affineVtransformation V is very large, because its output dimension w ha|.veThis a vocabulary withmemory size . cost Theto weigh t matrix the alinear ise |V imposes a high represent thedescribing matrix, and high component of this affine transformation is very the large, because its output dimension | | computational cost to multiply by it. Because softmax is normalized across all V is . This imposes a high memory cost to represent the matrix, and a high |V| outputs, it is necessary to perform the full matrix multiplication at training computational cost to multiply by it. Because the softmax product is normalized across allt time duct with the weigh weight V | |as well as test time—we cannot calculate only the dot pro outputs, is necessary to p erform full matrix mcosts ultiplication at training vector for theitcorrect output. The highthe computational of the output lay layer er as wellboth as test time—wetime cannot the dot pro ductitswith the weigh |time | arise th thus us at training (to calculate compute only the likelihoo likelihood d and gradien gradient) t) andt vat ector the(to correct output. The highfor computational of the layer test for time compute probabilities all or selectedcosts words). Foroutput sp specialized ecialized thusfunctions, arise boththe at gradien trainingt time compute efficiently the likelihoo d and its gradien loss gradient can b(to e computed (Vincent et al. , 2015t) ), and but at test time (to compute probabilities for all or selected w ords). F or sp ecialized the standard cross-entrop cross-entropy y loss applied to a traditional softmax output la lay yer poses loss functions, the gradient can be computed efficiently (Vincent et al., 2015), but the standard cross-entropy loss applied to a traditional softmax output layer poses 468
CHAPTER 12. APPLICATIONS
man many y difficulties. Supp Suppose ose that h is the top hidden lay layer er used to predict the output probabilities many difficulties. weights ts W yˆ. If we parametrize the transformation from h to yˆ with learned weigh h Supp ose that is the top hidden lay er used to predict the output probabilities and learned biases b, then the affine-softmax output lay layer er performs the follo following wing ˆ ˆ . If w e parametrize the transformation from to with learned weigh ts y h y W computations: and learned biases b, then the X affine-softmax output layer performs the following computations: ai = b i + W ij hj ∀i ∈ {1, . . . , |V|}, (12.8) j V a = b +e ai W h i 1, . . . , , (12.8) yˆi = P|V| . (12.9) ∀ ∈{ | |} a i0 e i 0=1 e yˆ = . (12.9) If h con contains tains n h elemen elements ts then the ab abov ov ovee op operation eration is O (|V|nh). With nh in the e X thousands and |V| in the hundreds of thousands, this op operation eration dominates the V n h n O ( n in the If con tains elemen ts then the ab ov e op eration is ) . With computation of most neural language mo models. dels. V thousands and in the hundreds of thousands, this op|eration dominates the | computation of |most | neural Planguage models. 12.4.3.1 Use of a Short List
12.4.3.1 Uselanguage of a Short List The first neural mo models dels (Bengio et al., 2001, 2003) dealt with the high cost of using a softmax ov over er a large number of output words by limiting the vocabulary The first neural language delsSch (Bengio al.,Gauv 2001ain , 2003 ) dealt with the high cost) size to 10,000 or 20,000 wmo ords. Schwenk wenk et and Gauvain (2002 ) and Sch Schwenk wenk (2007 of using a softmax over a large numberthe of output words yto limiting the vocabulary built up this approach by splitting vocabulary a shortlist upon on into V bin L of most size to 10,000 or 20,000 w ords. Sch wenk and Gauv ain ( 2002 ) and Sch wenk (2007) frequen frequentt words (handled by the neural net) and a tail T V= V\L of more rare L words built uponbythis bydel). splitting vocabulary into atw shortlist of most n-gram mo (handled an approach model). T To o bethe able to com combine bine two o predictions, the V L T the = frequen t words (handled b y the neural net) and a tail of more rare w ords neural net also has to predict the probability that a word app appearing earing after con context text n (handled an -gram mo del). T o b e able to com bine the tw o predictions, the \ C belongsby to the tail list. This may be ac achiev hiev hieved ed by adding an extra sigmoid output neural also an hasestimate to predict the thatextra a word appcan earing after context (i probability unit to net pro provide vide of P output then be used to ∈ T | C ). The C b elongs to the tail list. This may b e ac hiev ed by adding an extra sigmoid output ac achiev hiev hievee an estimate of the probabilit probability y distribution ov over er all words in V as follows: T C ). The extra output can then be used to unit to provide an estimate of P ( i V P (estimate y = i | Cof ) =1 P (y =∈i |yC, i ∈ L)(1 − ov P (er i ∈allTwords | C )) in as follows: (12.10) achieve an thei∈L probabilit |distribution (12.11) + 1 P (y = i | C, i L∈ T)P (i ∈ T |TC ) P (y = i C ) =1 i∈T P (y = i C, i )(1 P (i C )) (12.10) T T y=i| where P (y = i | C|, i ∈ L)+is1provided model P (y =|by i the C,∈i neural )− P (language i ∈ C| mo ) del and P ((12.11) n C, i ∈ T) is pro provided vided byLthe -gram mo model. del. modification, dification, this approac approach h | With ∈ slight∈mo | P (w y ork = iusing P softmax (y = i C, i an extra where ) is provided by theinneural language model and can also output v alue the neural language mo model’s del’s T C, ) is pro vided by the n-gram mounit. del. With slight modification, this approach| | a∈ la lay yier, rather than separate sigmoid can ∈ also work using an extra output value in the neural language model’s softmax Anrather ob obvious vious disadv disadvantage antage of the short list approach is that the potential generlayer, than a separate sigmoid unit. alization adv advantage antage of the neural language mo models dels is limited to the most frequent An where, obviousarguably disadvantage the short list approach is that the phas otential generwords, arguably, , it is of the least useful. This disadv disadvan an antage tage stimulated alization advantage of the neural language delswith is limited to the most outputs, frequent the exploration of alternativ alternative e methods tomo deal high-dimensional w ords, where, arguably , it is the least useful. This disadv an tage has stimulated describ described ed below. the exploration of alternative methods to deal with high-dimensional outputs, 469 described below.
CHAPTER 12. APPLICATIONS
12.4.3.2
Hierarc Hierarchical hical Softmax
12.4.3.2 A classical Hierarc approac approach hhical (Go Goo oSoftmax dman, 2001) to reducing the computational burden of high-dimensional output la lay yers ov over er large vocabulary sets V is to decomp decompose ose A classical approac h ( Go o dman , 2001 ) to reducing the computational burden probabilities hierarchically hierarchically.. Instead of necessitating a num numb ber V of computations of high-dimensional output la y ers ov er large v o cabulary sets is to decomp | V | prop proportional ortional to (and also prop proportional ortional to the number of hidden units, nose h), probabilities . Instead of necessitating a num ber )ofand computations |V| factorhierarchically the canVbe reduced to as low as log |V| . Bengio (2002 Morin and n ), proportional to (and also proportional to the number of hidden units, Bengio context text of neural language V (2005) introduced this factorized approach V to the con log the factor can b e reduced to as low as . Bengio ( 2002 ) and Morin and | | mo models. dels. Bengio ( 2005 ) introduced this factorized approach | | | | to the context of neural language One can think of this hierarch hierarchy y as building categories of words, then categories models. of categories of words, then categories of categories of categories of words, etc. One can think of thisform hierarch y aswith building ofves. words, categories These nested categories a tree, wordscategories at the lea leav In athen balanced tree, of categories of w ords, then categories of categories of categories of words, etc. the tree has depth O(log |V|). The probability of a cho hoosing osing a word is giv given en by the These nested formofa ctree, withthe words at the leaves. In a balanced tree, pro of thecategories probabilities ho branch leading to that word at ev product duct hoosing osing every ery V O(the log ro the depth ). of The a cho osing a word given by12.4 the no node detree on has a path from root ot theprobability tree to theofleaf con containing taining the wisord. Fig. product of athe probabilities choosing branch leading that word ery | | of Mnih illustrates simple example. andthe Hinton (2009 ) alsotodescrib describe e howattoevuse no de on a path from the ro ot of the tree to the leaf con taining the w ord. Fig. 12.4 multiple paths to identify a single word in order to better mo model del words that hav havee illustrates a simple example. Mnih and Hinton ( 2009 ) also describ e how to use multiple meanings. Computing the probability of a word then inv involves olves summation to identify a single wordword. in order to better model words that have omvultiple er all ofpaths the paths that lead to that multiple meanings. Computing the probability of a word then involves summation o predict the conditional probabilities node de of the tree, we overTall of the paths that lead to that word.required at each no typically use a logistic regression mo model del at eac each h no node de of the tree, and provide the T o predict the conditional probabilities required at eachthe node of theoutput tree, we C same context as input to all of these mo models. dels. Because correct is typically usethe a logistic model eac h nodelearning of the tree, and provide the enco encoded ded in trainingregression set, we can use at sup supervised ervised to train the logistic C as input same context to tall of these mousing dels. aBecause thecross-entrop correct output is regression models. This is ypically done standard cross-entropy y loss, enco ded in the training set, we can use sup ervised learning to train the logistic corresp corresponding onding to maximizing the log-lik log-likeliho eliho elihoood of the correct sequence of decisions. regression models. This is typically done using a standard cross-entropy loss, Because the output log-likelihoo log-likelihood d can be computed efficien efficiently tly (as lo low w as log |V| corresponding to maximizing the log-likelihood of the correct sequence of decisions. rather than |V| ), its gradients may also be computed efficien efficiently tly tly.. This includes not V Because the output log-likelihoo d can b e computed efficien tlyalso (as the low gradients as log only the gradient with resp respect ect to the output parameters but V rather than itshidden gradients may also be computed efficiently. This includes |not | with resp respect ect to ),the la lay yer activ activations. ations. only the gradient with resp ect to the output parameters but also the gradients | | It is possible but usually not practical to optimize the tree structure to minimize with respect to the hidden layer activations. the exp expected ected num umb ber of computations. Tools from information theory sp specify ecify ho how w It is p ossible but usually not practical to optimize the tree structure to minimize to choose the optimal binary co code de giv given en the relative frequencies of the words. To the exp ected n um b er of computations. oolsnum from theory specify w do so, we could structure the tree so thatTthe numb berinformation of bits asso associated ciated with a who ord to choose the optimal code givenofthe frequencies of theHo wwev ords. is appro equal binary to the logarithm therelative frequency of that word. approximately ximately Howev wever, er,Tino do so, we the could structure the savings tree so that the numbnot er ofworth bits asso with a word practice, computational are typically the ciated effort b ecause the is appro ximately equal to the logarithm of the frequency of that word. Ho wev er, in computation of the output probabilities is only one part of the total computation practice, the computational savings are typically not worth the effort b ecause the in the neural language mo model. del. For example, supp suppose ose there are l fully connected computation of the output probabilities is only one of the total computation n n hidden lay layers ers of width h . Let b be the weighted apart verage of the num numb ber of bits in the neural language model. For example, suppose there are l fully connected hidden layers of width n . Let n be the weighted average of the number of bits 470
CHAPTER 12. APPLICATIONS
(0)
(1)
(0,0)
(0,1)
(1,0)
(1,1)
w0
w1
w2
w3
w4
w5
w6
w7
(0,0,0)
(0,0,1)
(0,1,0)
(0,1,1)
(1,0,0)
(1,0,1)
(1,1,0)
(1,1,1)
Figure 12.4: Illustration of a simple hierarch hierarchy y of word categories, with 8 words w 0, . . . , w7 organized into a three lev level el hierarch hierarchy y. The lea leav ves of the tree represent actual sp specific ecific words. ,...,w Figure Illustration a simple of wno orddecategories, with 8 w ords In Internal ternal12.4: no nodes des representofgroups of hierarch words. yAny node can b e indexed by thewsequence organized a three level hierarch y.t)The leaves the of the treefrom represent actual sp ecific words. of binary into decisions (0=left, 1=righ 1=right) to reach no node de the ro root. ot. Sup Super-class er-class (0) In ternal no des represent groups of words. Any no de can b e indexed by the sequence con contains tains the classes (0 (0,, 0) and (0 , 1) 1),, which resp respectiv ectiv ectively ely contain the sets of words {w0, w1 } (0) of binary (0=left, sup 1=righ t) to(1) reach the nothe de from Sup , 0)ot.and , 1) and super-class er-class contains classesthe (1 (1,ro (1er-class 1),, which {w 2, wdecisions 3 }, and similarly (0 0) (0 1) , , , w w contains the classes the words and (w4 ,, w which resp contain theissets of words balanced, resp respectiv ectiv ectively ely contain (wectiv ). If the tree sufficiently 5) and 6 , w7ely , w , (1) (1 0) (1, 1) w and , and similarly sup er-class contains the classes {, which the maximum depth (num (numb b er of binary decisions) is on the order of and the logarithm of} ( ) ( ) w , w w , w resp ectiv ely contain the words and . If the tree is sufficiently balanced, { b er }of words |V|: the choice of one out of |V| words can b e obtained by doing the num numb the maximum depth (one (num b er ofh of binary decisions) is on from the order of of O (log operations erations eac each the nodes on theV path the ro root). ot).the In logarithm this example, |V| ) op V for the num b er of words : the choice of one out of w ords can b e obtained by doing y computing the probability of a word can b e done b y multiplying three probabilities, V O (log | | for | t at erations (one each of the nodes on the path from theno rode ot).onInthe this example, asso associated ciated) op with the binary decisions to mov move e left or |righ right each node path from y computing probability done bdecision y multiplying three probabilities, |ot| to the the ro root a node b e thecan when tra traversing versing the tree y. Let of bi (ay )word i-thb ebinary asso ciated the decisions to e left an or output right atyeach no deoses on in the from y . binary to towards wards thewith value The probability of mov sampling decomp decomposes into to path a pro product duct ) y b ( y i the ro ot to a node . Let b e the -th binary decision when tra versing the treeh of conditional probabilities, using the chain rule for conditional probabilities, with eac each y towards the value . The probability sampling an output ydedecomp oses into a pro no node de indexed by the prefix of these of bits. For example, no node (1 (1,, 0) corresp corresponds onds toduct the of conditional probabilities, rule for ofconditional with eac prefix (b0 (w4) = 0) 0),, andthe thechain probability w4 can b e probabilities, decomp decomposed osed as follo follows: ws:h 11,, b1(w 4 ) = using no de indexed by the prefix of these bits. For example, no de (1, 0) corresp onds to the (b(y(w , bP((wb0 )==11,0) = )w= , b, 1and = 00,,the b2 = 0) (12.12) prefix P probability of w can b e decomp osed as follo ws: 4 ) 1= = P (b = 1)P (b = 0 | b = 1)P (b 2 = 0 | b 0 = 11,, b 1 = 0) 0).. (12.13) P (y = w ) = P (b0 = 1, b =1 0, b = 00) (12.12) = P (b = 1)P (b = 0 b = 1)P (b = 0 b = 1, b = 0). (12.13) | | 471
CHAPTER 12. APPLICATIONS
required to iden identify tify a word, with the weigh eighting ting given by the frequency of these words. In this example, the num numb ber of op operations erations needed to compute the hidden required to grows identify ord, thethe weigh ting given by the frequency activ activations ations as aaswO ) while output computations grow as Oof(nthese (ln 2hwith h nb ). w ords. In this example, the num b er of op erations needed to compute the hidden As long as nb ≤ ln h, we can reduce computation more by shrinking nh than by activationsn grows as as O (ln ) while the output computations grow as O (n n ). shrinking b. Indeed, nb is often small. Because the size of the vo cabulary rarely n n ab lwords n , weand As long aasmillion canlog reduce than by, 6 computation more by shrinking exceeds 20,, it is possible to reduce nb to about out 20 2 (10 ) ≈ 20 n . Indeed, n is often shrinking small. Because size than of thecarefully vocabulary rarely ≤muc nh is often but uch h larger, around 103 or more. the Rather optimizing logof(10 n todepth a million words and , itinstead is possible to reduce abouttw 20o, aexceeds tree with a branching factor 2, )one 20 can define a tree with two paround n but is often m uc h larger, 10 or more. Rather than carefully optimizing and a branching factor of |V|. Such≈a tree corresp corresponds onds to simply defining a set a tree with aexclusive branching factor of 2,The one simple can instead define a tree two of mutually word classes. approach based on with a treedepth of depth V a branching factor of computational . Such a tree corresp onds to simply defining. a set tand wo captures most of the benefit of the hierarc hierarchical hical strategy strategy. of mutually exclusive word classes. The simple approach based on a tree of depth | | One question that remains somewhat op open en is how to best define these word two captures most of the computational benefit of the hierarchical strategy. classes, or how to define the word hierarch hierarchy y in general. Early work used existing Onehies question somewhat ophierarc en is how to balso est b define theseideally word hierarc hierarchies (Morinthat and remains Bengio , 2005 ) but the hierarch hy can e learned, p classes, or how to define the word hierarch y in general. Early usedAn existing join jointly tly with the neural language mo model. del. Learning the hierarch hierarchy y iswork difficult. exact hierarc hies ( Morin and Bengio , 2005 ) but the hierarc h y can also b e learned, ideally optimization of the log-lik log-likeliho eliho elihoood app appears ears in intractable tractable because the choice of a word join tly with the neural language mo del. Learning the hierarch y is difficult. How An exact hierarc hierarch hy is a discrete one, not amenable to gradient-based optimization. However, ever, optimization the log-lik elihood appears intractable because the cthe hoice of a word one could useofdiscrete optimization to approximately optimize partition of hierarc h y is a discrete one, not amenable to gradient-based optimization. How ever, words into word classes. one could use discrete optimization to approximately optimize the partition of Aninto imp importan ortan ortanttclasses. adv advan an antage tage of the hierarchical softmax is that it brings computawords word tional benefits both at training time and at test time, if at test time we want to An imp ortan t advantage ofecific the hierarchical softmax is that it brings computacompute the probability of sp specific words. tional benefits both at training time and at test time, if at test time we want to |V| words will remain exp Of course, computingofthe probability expensive ensive compute the probability specific words. of all ev even en with the hierarchical softmax. Another imp important ortant op operation eration is selecting the V Of course, computing the probability of all w ords willstructure remain exp ensive most lik likely ely word in a giv given en con context. text. Unfortunately the tree do does es not ev en with hierarchical softmax. Another operation is selecting the | ortant | pro provide vide anthe efficient and exact solution to thisimp problem. most likely word in a given context. Unfortunately the tree structure does not A disadv disadvan an antage tage isand that in practice softmax tends to give worse provide an efficient exact solutionthe to hierarchical this problem. test results than sampling-based metho methods ds we will describ describee next. This may be due A disadv an tage is that in practice the hierarchical softmax tends to give worse to a poor choice of word classes. test results than sampling-based methods we will describe next. This may be due to a poor choice of word classes. 12.4.3.3 Imp Importance ortance Sampling 12.4.3.3 ortance One way toImp sp speed eed up the Sampling training of neural language mo models dels is to av avoid oid explicitly computing the con contribution tribution of the gradient from all of the words that do not app appear ear One w a y to sp eed up the training of neural language mo dels is to av oid explicitly in the next position. Every incorrect word should hav havee low probability under the computing the bcon tribution of the gradient all of the ords thatwdo notInstead, appear mo model. del. It can e computationally costly tofrom enumerate allwof these ords. in is the next pto osition. Every word e low under the it possible sample only incorrect a subset of the should words. hav Using theprobability notation introduced model. It can be computationally costly to enumerate all of these words. Instead, it is possible to sample only a subset of the words. Using the notation introduced 472
CHAPTER 12. APPLICATIONS
in Eq. 12.8, the gradien gradientt can be written as follo follows: ws: ∂ log as softmax ∂ log tP can (y | bCe) written y (a) in Eq. 12.8, the gradien follows: = (12.14) ∂θ ∂θ ∂ log softmax (a) ∂ log P (y C ) ay = ∂ log Pe (12.14) = (12.15) ∂θ | ∂ θ eai ∂θ ∂ ei X = ∂ log (12.15) = ∂ θ (ay − log eai ) e (12.16) ∂θ i ∂ = ∂ a(ya X e ) ∂ ai log (12.16) = ∂ θ − − P (y = i | C ) (12.17) ∂θ ∂θ i ∂a ∂a P = P (y = i C ) (12.17) where a is the vector of pre-softmax activations ations (or scores), elementt ∂ θactiv ∂ θ with one elemen | per word. The first term is the positive−phase X term (pushing ay up) while the a where is the vector of pre-softmax activ ations (or scores), with second term is the ne negative gative phase term (pushing ai do down wn for all i, one withelemen weightt p er word. The first term is the p ositive phase term (pushing up) while the a P (i | C ). Since the negative phase term is an expectation, ectation, we can estimate it with Xexp i, with the negative phase term (pushing down for all the weight asecond Monteterm Carloissample. Ho How wev ever, er, that would requireasampling from mo model del itself. P (i C ). Since the negative phase term is an exp it with, i | C) for we P (ectation, i in estimate Sampling from the mo model del requires computing all can the vocabulary vocabulary, a Monte Carlo sample. ever, that w require sampling from the model itself. |h is precisely whic which what Ho weware trying toould av avoid. oid. Sampling from the model requires computing P (i C) for all i in the vocabulary, of sampling thetrying mo model, del, can sample from another distribution, whicInstead h is precisely what from we are to one avoid. | q called the prop distribution (denoted ), and use appropriate weigh proposal osal weights ts to correct Instead of sampling from the mo del, one can sample from another for the bias in intro tro troduced duced by sampling from the wrong distribution (distribution, Bengio and q called the prop osal distribution (denoted ), and use appropriate weigh ts to general correct Sénécal, 2003; Bengio and Sénécal, 2008). This is an application of a more for the bias in tro duced b y sampling from the wrong distribution ( Bengio tec technique hnique called imp importanc ortanc ortancee sampling sampling,, which will be describ described ed in more detailand in Sénécal , .2003 ; Bengio and Sénécal , 2008 ). This is an application of a more general Sec. 17.2 Unfortunately ev exact imp sampling is not efficient b ecause it Unfortunately,, even en importance ortance tec hnique called imp ortanc e sampling , which will b e describ ed in more detail in requires computing weights pi /q i, where pi = P (i | C ), whic which h can only be computed Sec. 17.2 . Unfortunately , ev en exact imp ortance sampling is notapplication efficient because if all the scores ai are computed. The solution adopted for this is calledit i C p /q p = P ( requires computing w eights , where ) , whic h can only b e computed biase biased d imp importanc ortanc ortancee sampling sampling,, where the imp importance ortance weigh weights ts are normalized to sum a if all the scores are computed. The solution adopted for this application is called | to 1. When negative word ni is sampled, the asso associated ciated gradien gradient t is weigh weighted ted by biased importance sampling, where the importance weights are normalized to sum p n /qn i to 1. When negative word n iswisampled, ciated gradient is weighted by = PN ithe asso . (12.18) p /q n n j=1 p /qj j w = . (12.18) These weigh weights ts are used to giv givee the appropriate importance ortance to the m negativ negativee p /q imp samples from q used to form the estimated negative phase con contribution tribution to the These weigh ts are used to giv e the appropriate imp ortance to the m negative gradien gradient: t: samples from q used to |form the estimated negative phase con tribution to the V| m X X ∂ a 1 ∂ a i n i gradient: P (i | C ) P ≈ wi . (12.19) ∂θ m ∂θ i=1 ∂a 1 i=1 ∂ a P (i C )works well as w . (12.19) A unigram or a bigram distribution the ∂prop proposal ∂θ m θ osal distribution q . It is easy to estimate the parameters |of such a≈distribution from data. After estimating q . It is. A or a bigram the prop osal distribution theunigram parameters, it is alsodistribution possible to works samplewell fromassuch a distribution very efficiently efficiently. easy to estimate the parameters of such a distribution from data. After estimating Xsuch a distribution very efficiently. 473 from the parameters, it is alsoX possible to sample
CHAPTER 12. APPLICATIONS
Imp Importance ortance sampling is not only useful for sp speeding eeding up mo models dels with large softmax outputs. More generally generally,, it is useful for accelerating training with large Impoutput ortancelay sampling is not useful speeding models with sparse layers, ers, where the only output is a for sparse vectorup rather than a 1large -of-n softmax outputs. More generally , it is useful for accelerating training with large choice. An example is a bag of wor words ds ds.. A bag of words is a sparse vector v where vi sparse output lay ers, where the output a sparse vector rather than acumen 1-of-nt. indicates the presence or absence of wordisi from the vocabulary in the do documen cument. choice. An ,example is a bag of wor ds. bAerbag of words is word a sparse vector where v vi can indicate i app Alternately Alternately, the num umb of times that appears. ears.v Machine i from the indicates mo thedels presence or absence of wordvectors in the documen learning models that emit such sparse can vocabulary be exp expensive ensive to train for t. a v i Alternately , can indicate the n um b er of times that word app ears. Machine variet ariety y of reasons. Early in learning, the mo model del may not actually choose to make learning mo dels that emit such sparse v ectors can beweexp ensive to trainmight for a the output truly sparse. Moreo Moreover, ver, the loss function use for training vmost arietynaturally of reasons. Early in model may not element actually of choose to make be describ described ed learning, in terms the of comparing every the output to the output truly sparse. Moreo ver, the loss function we use for training might ev every ery elemen elementt of the target. This means that it is not alwa always ys clear that there is a most naturally b e describ ed in terms of comparing every element the output to computational benefit to using sparse outputs, because the mo model delofmay choose to ev ery elemen t of the target. This means that it is not alwa ys clear that there is mak makee the ma majorit jorit jority y of the output non-zero and all of these non-zero values need toa computational enefit to using sparse outputs, model may choose to b e compared to bthe corresp corresponding onding training target,because even if the the training target is zero. make the et maal. jorit y of )the output non-zero ofdels thesecan non-zero values need to Dauphin (2011 demonstrated thatand suchallmo models be accelerated using b e compared to the corresp target, even if the the loss training target is zero. imp importance ortance sampling. The onding efficienttraining algorithm minimizes reconstruction for Dauphin et al. ( 2011 ) demonstrated that such mo dels can b e accelerated using the “p “positive ositive words” (those that are non-zero in the target) and an equal num numb ber imp ortance sampling. The efficient algorithm minimizes the loss reconstruction for of “negativ w ords.” The negativ w ords are c hosen randomly using a heuristic to “negativee negativee randomly,, the “p ositive words” (those that are non-zero in the target) and an equal num b er sample words that are more likely to be mistaken. The bias introduced by this of “negativ eersampling words.” The negativ ords are chosen randomly , using a heuristic to heuristic ov oversampling can then ebewcorrected using imp importance ortance weights. sample words that are more likely to be mistaken. The bias introduced by this In all ov ofersampling these cases,can thethen computational of gradien gradient t estimation for heuristic be correctedcomplexity using importance weights. the output lay layer er is reduced to be prop proportional ortional to the num numb ber of negative samples In all of prop theseortional cases, the computational complexity of gradient estimation for rather than proportional to the size of the output vector. the output layer is reduced to be proportional to the number of negative samples rather than proportional to the size of the output vector. 12.4.3.4 Noise-Con Noise-Contrastiv trastiv trastive e Estimation and Ranking Loss 12.4.3.4 Noise-Con e Estimation and Ranking Loss the computaOther approaches basedtrastiv on sampling hav havee been prop proposed osed to reduce tional cost of training neural language mo models dels with large vocabularies. An early Other approaches based loss on sampling been prop osed W toeston reduce the computaexample is the ranking prop proposed osedhav bye Collob Collobert ert and (2008a ), whic which h tional cost of training neural language mo dels with large vocabularies. An early views the output of the neural language mo model del for each word as a score and tries to example is the ranking loss prop osed by Collobhigh ert and Weston (2008a ), whic mak makee the score of the correct word ay be ranked in comparison to the otherh views of the neural language model scoresthe ai . output The ranking loss prop proposed osed then is for each word as a score and tries to make the score of the correct word a be ranked high in comparison to the other X scores a . The ranking lossLprop then, 1is− ay + ai). = osed (12.20) max(0 i
L= (12.20) max(0, 1 a + a ). ay, is The gradien gradientt is zero for the i-th term if the score of the observ observed ed word, − greater than the score of the negative word ai by a margin of 1. One issue with i-th a , is The gradient isis that zeroitfor the term ifestimated the scoreconditional of the observ ed word, which this criterion do does es not provide probabilities, greater than the score of the negative word a by a margin of 1. One issue with X this criterion is that it does not provide estimated conditional probabilities, which 474
CHAPTER 12. APPLICATIONS
are useful in some applications, including sp speech eech recognition and text generation (including conditional text generation tasks suc such h as translation). are useful in some applications, including speech recognition and text generation A more recently used training ob objective jective for neural language mo model del is noise(including conditional text generation tasks such as translation). con contrastiv trastiv trastivee estimation, which is in intro tro troduced duced in Sec. 18.6. This approach has been A more recently used training ob jective for(Mnih neuraland language model is noisesuccessfully applied to neural language mo models dels Teh, 2012 ; Mnih and con trastiv e estimation, Ka Kavuk vuk vukcuoglu cuoglu , 2013). which is introduced in Sec. 18.6. This approach has been successfully applied to neural language models (Mnih and Teh, 2012; Mnih and Kavukcuoglu, 2013).
12.4.4
Com Combining bining Neural Language Models with n-grams n n-gram mo A ma major jor adv advantage antage of n-gram mo models dels ov over er neural netw networks orks is that models dels 12.4.4 Com bining Neural Language Models with -grams
ac achiev hiev hievee high mo model del capacity (b (by y storing the frequencies of very man many y tuples) n-gram n-gram A ma jor advantage models overtoneural netw is that (b models while requiring veryof little computation process anorks example (by y lo looking oking up ac hiev e high mo del capacity (b y storing the frequencies of very man y tuples) only a few tuples that match the curren currentt context). If we use hash tables or trees while requiring v ery little computation to process an example (byindep looking up to access the counts, the computation used for n -grams is almost independent endent only a few tuples that match the curren t context). If we use hash tables or trees of capacit capacity y. In comparison, doubling a neural net netw work’s num numb ber of parameters n -grams access also the counts, computation used fortime. is almost indepmo endent tto ypically roughly the doubles its computation Exceptions include models dels of capacit y . In comparison, doubling a neural net w ork’s num b er of parameters that avoid using all parameters on eac each h pass. Em Emb bedding la lay yers index only a single tem ypically also roughly doubles its computation time. Exceptions include models emb bedding in eac each h pass, so we can increase the vocabulary size without increasing that a v oid using all parameters on eac h pass. Em b edding la y ers index only a single the computation time per example. Some other mo models, dels, suc such h as tiled con conv volutional embwedding in eac h pass, so we can increase the vocabulary without increasing net netw orks, can add parameters while reducing the degreesize of parameter sharing the computation time p er example. Some other mo dels, suc h as tiled con v olutional in order to maintain the same amount of computation. How Howev ev ever, er, typical neural net w orks, can add parameters while reducing the degree of parameter sharing net netw work la layers yers based on matrix multiplication use an amoun amountt of computation in order to maintain the same amount of computation. How ev er, typical neural prop proportional ortional to the num numb ber of parameters. network layers based on matrix multiplication use an amount of computation easy w to add is thus to com combine bine both approac approaches hes in an ensem ensemble ble propOne ortional toaythe numcapacity ber of parameters. consisting of a neural language mo model del and an n-gram language mo model del (Bengio One easy w ay to add capacity is thus to com bine b oth approac hes in et al. al.,, 2001, 2003). As with any ensem ensemble, ble, this technique can reduce an testensem errorble if n consisting of a neural language mo del and an -gram language mo del ( Bengio the ensem ensemble ble mem memb bers mak makee indep independent endent mistak mistakes. es. The field of ensemble learning et al. , 2001 , 2003 ).ysAs any ensem this technique reduce test error if pro provides vides many wa ways of with combining the ble, ensemble members’can predictions, including the ensem mem ers mak e ts indep endent es. The fieldMik of olo ensemble uniform wble eigh eighting ting band weigh weights chosen on mistak a validation set. Mikolo olov v et al. learning (2011a) pro vides many wa ys of combining the ensemble members’ predictions, including extended the ensem ensemble ble to include not just tw two o mo models dels but a large array of mo models. dels. uniform w eigh ting and weigh ts chosen on a v alidation set. Mik olo v et al. ( 2011a It is also possible to pair a neural netw network ork with a maxim maximum um entrop entropy y mo model del and) extended ensem ble tovinclude not just two mo dels but a large arrayasoftraining models. train boththe join jointly tly (Mikolo Mikolov et al. al.,, 2011b ). This approac approach h can be viewed is also net possible to pair neural ork with a maxim um entrop y model aIt neural netw work with anaextra setnetw of inputs that are connected directly to and the train b oth join tly ( Mikolo v et al. , 2011b ). This approac h can b e viewed as training output, and not connected to an any y other part of the mo model. del. The extra inputs are a neural net w ork with an extra set of inputs that are connected directly the indicators for the presence of particular n-grams in the input con context, text, so to these to any and other part of theThe moincrease del. Theinextra inputs are voutput, ariablesand are vnot ery connected high-dimensional very sparse. mo model del capacity n-grams indicators fornew theppresence of particular in up thetoinput text, so these |sV |n con is huge—the ortion of the architecture contains parameters—but vthe ariables are of very high-dimensional verytosparse. The model capacity amount added computation and needed pro process cess an increase input isin minimal because sV is h uge—the new p ortion of the architecture contains up to parameters—but the extra inputs are very sparse. the amount of added computation needed to process an input | |is minimal because 475 the extra inputs are very sparse.
CHAPTER 12. APPLICATIONS
12.4.5
Neural Machine Translation
Mac Machine hine translation is the taskTofranslation reading a sentence in one natural language and 12.4.5 Neural Machine emitting a sentence with the equiv equivalent alent meaning in another language. Mac Machine hine Mac hine translation is the task of reading a sentence in one natural language translation systems often in involv volv volvee man many y components. At a high level, thereand is emitting a sentence with the equiv alent meaning in another language. Mac hine often one comp componen onen onentt that prop proposes oses man many y candidate translations. Many of these translation systems involve man At een a high level, thereFor is translations will not boften e grammatical dueytocomponents. differences betw etween the languages. often oneman comp onent thatput prop oses man y candidate Many these example, many y languages adjectiv adjectives es after nouns, sotranslations. when translated to of English translations will not b e grammatical due to differences b etw een the languages. For directly they yield phrases such as “apple red.” The prop proposal osal mec mechanism hanism suggests example, manyoflanguages put adjectiv es after nouns, so when“red translated man many y variants the suggested translation, ideally including apple.” toAEnglish second directly they phrases suchsystem, as “apple red.” Themo prop mechanism suggests comp componen onen onent t ofyield the translation a language model, del,osal ev evaluates aluates the prop proposed osed man y v ariants of the suggested translation, ideally including “red apple.” A second translations, and can score “red apple” as better than “apple red.” component of the translation system, a language model, evaluates the proposed The earliest of score neural“red net netw works for mac machine hine translation was to upgrade the translations, anduse can apple” as b etter than “apple red.” language mo of a translation system by using a neural language mo model del model del (Sch Schwenk wenk The earliest use of neural net w orks for mac hine translation w as to upgrade had the et al. al.,, 2006; Sc Schw hw hwenk enk, 2010). Previously Previously,, most machine translation systems language del ofmo a del translation by using a neural models del (used Schwenk n-gramlanguage used an nmo -gram model for this system comp componen onen onent. t. The based mo models for et al. , 2006 ; Sc hw enk , 2010 ). Previously , most machine translation systems had mac machine hine translation include not just traditional back-off n-gram mo models dels (Jelinek n n used an -gram mo del for this comp onen t. The -gram based mo dels used for and Mercer, 1980; Katz, 1987; Chen and Goo Goodman dman, 1999) but also maximum macopy hine language translation include not just traditional back-off -gram models (Jelinek entr entropy mo models dels (Berger et al. , 1996), in whic which h nan affine-softmax lay layer er and Mercer , 1980 ; Katz , 1987 ; Chen and Goo dman , 1999 ) but also maximum predicts the next word giv given en the presence of frequent n-grams in the context. entropy language models (Berger et al., 1996), in which an affine-softmax layer Traditional language mo models dels rep report ort the probability of in a natural language predicts the next word giv en thesimply presence of frequent n-grams the context. sen sentence. tence. Because mac machine hine translation inv involves olves pro producing ducing an output sen sentence tence giv given en Traditional language moes dels simply report the of a natural language an input sentence, it mak makes sense to extend theprobability natural language mo model del to be sentence. Because machine translation inv,olves ducing anard output sentence givdel en conditional. As describ described ed in Sec. 6.2.1.1 it is pro straightforw straightforward to extend a mo model an input sentence, it mak es sense to extend the natural language mo del to b that defines a marginal distribution ov over er some variable to define a conditionale conditional. oAs ed in Sec. , it isC,straightforw ardt btoe aextend model distribution ver describ that variable given6.2.1.1 a con context text where C migh might single avariable that defines a marginal distribution variable to define a conditional or a list of variables. Devlin et al. (2014ov ) bereatsome the state-of-the-art in some statistical distribution o v er that v ariable given a con text , where migh t b e a single C C mac machine hine translation benchmarks by using an MLP to score a phrase t 1, t2v,ariable . . . , tk or a list of v ariables. Devlin et al. ( 2014 ) b eat the state-of-the-art in some statistical in the target language given a phrase s 1, s2 , . . . , sn in the source language. The machine translation by using an MLP to score a phrase t , t , . . . , t P (t1,btenchmarks MLP estimates 2 , . . . , tk | s1, s2 , . . . , s n). The estimate formed by this MLP in the target languageprovided given a bphrase s , s , . .n.-gram , s in mo thedels. source language. The replaces the estimate y conditional models. MLP estimates P (t , t , . . . , t s , s , . . . , s ). The estimate formed by this MLP A dra drawbac wbac wback k of the provided MLP-based approac approach h is that it requires the sequences to be replaces the estimate |by conditional n-gram models. prepro preprocessed cessed to be of fixed length. To mak makee the translation more flexible, we would A dra wbac k of the MLP-based approac h isvariable that it requires the sequences to be lik likee to use a mo model del that can accommo accommodate date length inputs and variable preprocessed to bAn e ofRNN fixedpro length. o mak e the. translation flexible, we would length outputs. provides videsTthis ability ability. Sec. 10.2.4 more describ describes es several wa ways ys lik e to use a mo del that can accommo date v ariable length inputs and v ariable of constructing an RNN that represents a conditional distribution ov over er a sequence length outputs. An RNN pro vides this ability . Sec. 10.2.4 describ es several ways giv given en some input, and Sec. 10.4 describ describes es ho how w to accomplish this conditioning of constructing ana RNN that represents conditional distribution er a sequence sequence when the input is sequence. In all cases,a one mo model del first reads the ov input givenemits someainput, and Sec. that 10.4 describ es how to input accomplish this conditioning and data structure summarizes the sequence. We call this when the input is a sequence. In all cases, one model first reads the input sequence 476 and emits a data structure that summarizes the input sequence. We call this
CHAPTER 12. APPLICATIONS
Output object (English sentence) Decoder
Intermediate, semantic representation
Encoder Source object (French sentence or image)
Figure 12.5: The enco encoder-deco der-deco der-decoder der arc architecture hitecture to map back and forth b etw etween een a surface represen representation tation (suc (such h as a sequence of words or an image) and a semantic represen representation. tation. Figure 12.5: The enco derofarc hitecture to map back and forth etween surface By using the output of der-deco an enco encoder der data from one mo modalit dalit dality y (such as thebenco encoder der amapping represen tation (suc h asto a hidden sequence of words or ancapturing image) and semanticofrepresen tation. from Frenc rench h sen sentences tences representations theameaning sentences) as By using the output of an enco der of data from one mo dalit y (such as the enco der mapping the input to a decoder for another mo modality dality (such as the deco decoder der mapping from hidden from Frenc h sencapturing tences to the hidden representations the w meaning of sentences) as represen representations tations meaning of sentencescapturing to English), e can train systems that the input from to a one decoder for another mo dality as the deco der mapping fromnot hidden translate mo modality dality to another. This(such idea has b een applied successfully just represen tations capturing meaning of sentences to English), we can train systems that to machine translation butthe also to caption generation from images. translate from one mo dality to another. This idea has b een applied successfully not just to machine translation but also to caption generation from images.
summary the “con “context” text” C. The context C ma may y be a list of vectors, or it may be a vector or tensor. The mo model del that reads the input to pro produce duce C ma may y be an RNN C C summary the “con text” . The context ma y b e a list of vectors, or itvmay be a (Cho et al., 2014a; Sutskev Sutskever er et al. al.,, 2014; Jean et al. al.,, 2014) or a con conv olutional C ma vnet ector or (tensor. The moand del Blunsom that reads the ). input to produce y be an netw work Kalch Kalchbrenner brenner , 2013 A second mo model, del, usually an RNN RNN, (then Cho reads et al.,the 2014a ; Sutskev er et al. , 2014 ; Jean et al. , 2014 ) or a con v olutional con context text C and generates a sen sentence tence in the target language. This net w ork ( Kalch brenner and Blunsom , 2013 ). A machine second mo del, usually an RNN, general idea of an enco encoder-deco der-deco der-decoder der framework for translation is illustrated C then reads the con text and generates a sen tence in the target language. This in Fig. 12.5. general idea of an encoder-decoder framework for machine translation is illustrated In order to generate an en entire tire sen sentence tence conditioned on the source sentence, the in Fig. 12.5. mo model del must ha hav ve a wa way y to represent the entire source sen sentence. tence. Earlier mo models dels In order to generate an en tire sen tence conditioned on the source sentence, the were only able to represent individual words or phrases. F From rom a representation mo del must ha v e a wa y to represent the entire source sen tence. Earlier mo dels learning poin ointt of view, it can be useful to learn a represen representation tation in which sentences w ere hav onlye able to represent words or phrases. rom a representation that have the same meaningindividual ha have ve similar represen representations tationsFregardless of whether learning p oin t of view, it can b e useful to learn a represen tation in which sentences they were written in the source language or the target language. This strategy was that hav e the same meaning ha ve similar represen tations regardless of whether explored first using a com combination bination of conv convolutions olutions and RNNs (Kalch Kalchbrenner brenner and they w ere written in the source language or the target language. This strategy was Blunsom, 2013). Later work in intro tro troduced duced the use of an RNN for scoring prop proposed osed explored first(Cho using bination olutions translated and RNNssentences (Kalchbrenner and translations et aal.com , 2014a ) andofforconv generating (Sutskev Sutskever er Blunsom , 2013 ). Later work in tro duced the use of an RNN for scoring prop osed et al., 2014). Jean et al. (2014) scaled these mo models dels to larger vocabularies. translations (Cho et al., 2014a) and for generating translated sentences (Sutskever et al., 2014). Jean et al. (2014) scaled these models to larger vocabularies. 477
CHAPTER 12. APPLICATIONS
12.4.5.1
Using an Atten ttention tion Mechanism and Aligning Pieces of Data
12.4.5.1
Using an Attention Mechanism and Aligning Pieces of Data c +
↵ (t1)
↵ (t) ⇥
h(t1)
↵ (t+1) ⇥
h(t)
⇥ h(t+1)
Figure 12.6: A mo modern dern attention mechanism, as introduced by Bahdanau et al. (2015), is essen essentially tially a weigh eighted ted average. A context vector c is formed by taking a weigh weighted ted av average erage (t) attention mechanism, Figure 12.6: A moh dern as introduced by Bahdanau (2015 is of feature vectors with weigh weights ts α (t) . In some applications, the featureetval. ectors h ),are c is essentially a wof eigh average. A context vector formed by taking a weigh teddel. average hidden units a ted neural netw network, ork, but they may also b e raw input to the mo model. The h duced α del. In h are t) ofeights feature with by weigh someThey applications, thevfeature w are pro produced thetsmo model itself. are usually alues invectors the in interv terv terval al α (vectors ( t ) hidden units of a neural ork, butaround they may b e raw inputthe toweigh the mo The are intended to netw concentrate just also one h so that weighted teddel. average [0 [0,, 1] and α are (t) the interval w eights pro duced mo del itself. are .usually values appro approximates ximates reading that by onethe sp specific ecific time stepThey precisely precisely. The weigh weights ts αin are usually 1] [0 , h and are intended to concentrate around just one so that the weigh ted pro produced duced by applying a softmax function to relev relevance ance scores emitted by another apverage ortion appro reading that one sp ecific time stepexp precisely The weights α than are directly usually of theximates mo model. del. The attention mechanism is more expensiv ensiv ensivee .computationally pro ducedthe by desired applying function tocannot relevance scores emitted by another pt.ortion h(ta), softmax indexing but direct indexing b e trained with gradient descen descent. The of the mo del. The attention mechanism is more exp ensiv e computationally than directly atten attention tion mechanism based on weigh weighted ted av averages erages is a smo smooth, oth, differentiable appro approximation ximation indexing desired h directoptimization indexing cannot b e trained with gradient descent. The that can the b e trained with, but existing algorithms. attention mechanism based on weighted averages is a smo oth, differentiable approximation thatUsing can b ea trained withrepresen existingtation optimization algorithms. fixed-size representation to capture all the semantic details of a very
long sentence of say 60 words is very difficult. It can be achiev achieved ed by training a Using a fixed-size represen tation to capture all the semantic details ofbay vCho ery sufficien sufficiently tly large RNN well enough and for long enough, as demonstrated long sentence of say 60 w ords is v ery difficult. It can b e achiev ed by training et al. (2014a) and Sutskev Sutskever er et al. (2014). Ho How wev ever, er, a more efficien efficientt approach isa sufficien tly large well or enough and for as demonstrated bywhat Cho to read the wholeRNN sen sentence tence paragraph (tolong get enough, the context and the gist of et bal. (2014a ) and Sutskev er et al. the (2014 ). Howevwords er, a more efficien t approach is is eing expressed), then pro produce duce translated one at a time, each time to read the sentence (totence get the context thethe gistsemantic of what fo focusing cusing on whole a different part or of paragraph the input sen sentence in order to and gather is b eing expressed), then pro duce the translated words one at a time, each details that are required to pro produce duce the next output word. That is exactlytime the focusing a different of the input sentence in order to gather the semantic idea thaton Bahdanau et part al. (2015 ) first introduced. The attention mechanism used details are required to the produce next output Thatisisillustrated exactly the to fo focus custhat on sp specific ecific parts of inputthe sequence at eachword. time step in idea that Bahdanau et al. ( 2015 ) first introduced. The attention mechanism used Fig. 12.6. to focus on specific parts of the input sequence at each time step is illustrated in We can think of an atten attention-based tion-based system as having three comp componen onen onents: ts: Fig. 12.6. We can think of an attention-based system as having three components: 478
CHAPTER 12. APPLICATIONS
1. A pro process cess that “ reads ads”” ra raw w data (such as source words in a source sentence), and conv converts erts them into distributed represen representations, tations, with one feature vector 1. asso A pro cess that “ r e ads ” ra w data (such as source words in a source sentence), associated ciated with each word position. and converts them into distributed representations, with one feature vector 2. A list of feature ve vectors ctors the output of the reader. This can be asso ciated with each wordstoring position. understo understoo od as a “ memory” con containing taining a sequence of facts, which can be 2. retriev A list of feature ve ctors storing thesame output of without the reader. This can b retrieved ed later, not necessarily in the order, having to visit alle understo of them. od as a “ memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all 3. A process cess that “ exploits” the conten contentt of the memory to sequentially perform of pro them. a task, at eac each h time step having the ability put atten attention tion on the con content tent of 3. one A pro cess that “ exploits ” the conten t of the memory to sequentially p erform memory element (or a few, with a different weigh eight). t). a task, at each time step having the ability put attention on the content of one memory element (or athe few,translated with a different weight). The third comp component onent generates sentence. in a sen sentence tence written in one language are aligned with corresp correspondondTheWhen third words component generates the translated sentence. ing words in a translated sentence in another language, it becomes possible to relate words in word a sentence written in one language areedaligned with corresp ondthe When corresp corresponding onding em emb beddings. Earlier work show showed that one could learn a ing words in a translated sentence in another language, it b ecomes p ossible to relate kind of translation matrix relating the word em emb beddings in one language with the the ondinginword emb(eddings. work ed that could learn w ordcorresp em emb beddings another Kočiský Earlier et al. al.,, 2014 ), show yielding low lower erone alignment errora kind of translation matrix relating the w beddingscounts in one in language withtable. the rates than traditional approac approaches hes based onord theem frequency the phrase word em eddings in work another Kočiskýcross-lingual et al., 2014),word yielding lower alignment error There is beven earlier on (learning vectors (Klemen Klementiev tiev et al. al.,, rates than traditional approac hes based on the frequency counts in the phrase table. 2012 2012). ). Many extensions to this approac approach h are possible. For example, more efficien efficientt There is even alignment earlier work on learning vectors Klemen tiev et al., cross-lingual (Gouws et al.cross-lingual , 2014) allowsword training on (larger datasets. 2012). Many extensions to this approach are possible. For example, more efficient cross-lingual alignment (Gouws et al., 2014) allows training on larger datasets.
12.4.6
Historical Perspective
12.4.6 The idea ofHistorical distributed Perspective representations for symbols was in intro tro troduced duced by Rumelhart et al. (1986a) in one of the first explorations of bac back-propagation, k-propagation, with symbols The idea of distributed representations for symbols was troduced bork y Rumelhart corresp to the iden of family members and the in neural netw capturing corresponding onding identit tit tity y network et al. ( 1986a ) in one of the first explorations of bac k-propagation, with the relationships betw etween een family mem memb bers, with training examples formingsymbols triplets corresp onding to the iden tit y of family members and the neural netw ork capturing suc such h as (Colin, Mother, Victoria). The first lay layer er of the neural net netw work learned the relationships b etw een family mem b ers, with training examples forming a represen representation tation of eac each h family member. For example, the features fortriplets Colin suc h as (Colin, Mother, Victoria). The first lay er of the neural net w ork learned migh mightt represent which family tree Colin was in, what branch of that tree he was a represen tation of eac family or example, the neural featuresnet for Colin in, what generation he hwas from,member. etc. OneFcan think of the netw work as migh t represent which family tree Colin w as in, what branch of that tree he was computing learned rules relating these attributes together in order to obtain the in, whatpredictions. generationThe he was from, think of suc theh neural netwwho ork as desired mo model del can etc. then One makecan predictions such as inferring is computing learned rules relating these attributes together in order to obtain the the mother of Colin. desired predictions. The model can then make predictions such as inferring who is The idea of forming an em emb bedding for a symbol was extended to the idea of an the mother of Colin. em emb bedding for a word by Deerw Deerwester ester et al. (1990). These embeddings were learned The idea of forming an em b edding for abesymbol was to the idea of an using the SVD. Later, em emb beddings would learned byextended neural netw networks. orks. embedding for a word by Deerwester et al. (1990). These embeddings were learned 479 b e learned by neural networks. using the SVD. Later, embeddings would
CHAPTER 12. APPLICATIONS
The history of natural language pro processing cessing is mark marked ed by transitions in the popularit opularity y of different wa ways ys of represen representing ting the input to the mo model. del. Follo ollowing wing The history of natural language pro cessing is mark ed b y transitions in the this early work on sym symb bols or words, some of the earliest applications of neural p opularit different ways of represen ting the; Sc input toub the del. Following net netw works ytoofNLP (Miikkulainen and Dyer , 1991 Schmidh hmidh hmidhub uber er, mo 1996 ) represented this early w ork on sym b ols or w ords, some of the earliest applications of neural the input as a sequence of characters. networks to NLP (Miikkulainen and Dyer, 1991; Schmidhuber, 1996) represented Bengio et al. (2001) returned the focus to mo modeling deling words and in intro tro troduced duced the input as a sequence of characters. neural language mo models, dels, whic which h pro produce duce interpretable word em emb beddings. These Bengio et al. ( 2001 ) returned the focus to mo deling w ords and duced neural mo models dels hav havee scaled up from defining representations of a small setin oftro sym symb bols neural language mo dels, whic h pro duce interpretable word em b eddings. These in the 1980s to millions of words (including prop proper er nouns and missp misspellings) ellings) in neural dels have scaled from defining scaling representations small set of sym ols mo modern dernmo applications. Thisup computational effort ledoftoa the inv invention ention ofbthe in the 1980s to millions of words proper nouns and misspellings) in tec techniques hniques describ described ed ab abov ov ove e in Sec. (including 12.4.3. modern applications. This computational scaling effort led to the invention of the Initially Initially,, the use of words as the fundamental units of language mo models dels yielded techniques described above in Sec. 12.4.3. impro improv ved language modeling performance (Bengio et al., 2001). To this day day,, , the contin use of ually wordspush as the fundamental units ofmo language modelseryielded newInitially techniques continually both character-based models dels (Sutskev Sutskever et al. al.,, impro v ed language modeling p erformance ( Bengio et al. , 2001 ). T o this day 2011 2011)) and word-based mo models dels forward, with recent work (Gillick et al., 2015) ev even en, new techniques contin ually push b oth character-based mo dels ( Sutskev er et al. , mo modeling deling individual bytes of Unico Unicode de characters. 2011) and word-based models forward, with recent work (Gillick et al., 2015) even The ideas behind neural language models ha have ve been extended into several modeling individual bytes of Unicode characters. natural language pro processing cessing applications, suc such h as parsing (Henderson, 2003, 2004; Theertideas b), ehind neural language models ha ve role beenlab extended into several Collob Collobert , 2011 part-of-speech tagging, seman semantic tic labeling, eling, chunking, etc, natural language pro cessing applications, suc h as parsing ( Henderson , 2003 , 2004 sometimes using a single multi-task learning architecture (Collob Collobert ert and Weston eston,,; Collob; ert , 2011 seman role labeling, hunking, etc, 2008a Collob Collobert ert),etpart-of-speech al. al.,, 2011a) in tagging, which the wordticembeddings are cshared across sometimes using a single multi-task learning architecture (Collobert and Weston, tasks. 2008a; Collobert et al., 2011a) in which the word embeddings are shared across Two-dimensional visualizations of em emb beddings became a popular to tool ol for antasks. alyzing language mo models dels following the developmen developmentt of the t-SNE dimensionality Two-dimensional embHinton eddings became popular tool for anreduction algorithm (vvisualizations an der Maatenofand , 2008 ) anda its high-profile applialyzingtolanguage models following the developmen the in t-SNE cation visualization word em emb beddings by JosephtTof urian 2009.dimensionality reduction algorithm (van der Maaten and Hinton, 2008) and its high-profile application to visualization word embeddings by Joseph Turian in 2009.
12.5
Other Applications
In this section we Applications cov cover er a few other types of applications of deep learning that 12.5 Other are differen differentt from the standard ob object ject recognition, sp speech eech recognition and natural In this section we cov er a few other types of applications of deep that language processing tasks discussed ab abo ove. Part III of this booklearning will expand are differen t from the standard ob ject recognition, sp eech recognition and natural that scope ev even en further to include tasks requiring the ability to generate rich language processing tasks(unlike discussed ove.word,” Part inIIlanguage I of this b ook will expand high-dimensional samples “the ab next mo models). dels). that scope even further to include tasks requiring the ability to generate rich high-dimensional samples (unlike “the next word,” in language models).
480
CHAPTER 12. APPLICATIONS
12.5.1
Recommender Systems
One of theRecommender ma major jor families of Systems applications of mac machine hine learning in the information 12.5.1 tec technology hnology sector is the ability to make recommendations of items to potential One ma jor families ofjor applications of machinecan learning in the information usersoforthe customers. Two ma major types of applications be distinguished: online tec hnology sector is the ability to make recommendations of items to p otential adv advertising ertising and item recommendations (often these recommendations are still for userspurp or customers. ma jor types applications can bthe e distinguished: online the purpose ose of sellingTw a opro product). duct). Bothofrely on predicting asso association ciation bet etw ween adv ertising and item recommendations (often these recommendations are still for a user and an item, either to predict the probability of some action (the user the purp ose of selling a pro duct). Both rely on predicting the asso ciation b et w een buying the pro product, duct, or some proxy for this action) or the exp expected ected gain (which a user and an item, either to predict the probability of some action (the user ma may y dep depend end on the value of the pro product) duct) if an ad is sho shown wn or a recommendation is buyingregarding the product, some this The action) or theis exp ected gain (which made that or pro product duct proxy to thatforuser. internet currently financed in may dep endbyonvarious the value of the product) if an ad is sho wn or recommendation is great part forms of online advertising. There area ma major jor parts of the made regarding that ductshopping. to that user. The internet is currently in econom economy y that rely onpro online Companies including Amazonfinanced and eBay great part b y v arious forms of online advertising. There are ma jor parts of the use mac machine hine learning, including deep learning, for their pro product duct recommendations. economy that onare online shopping. Companies including and eBay Sometimes, therely items not pro products ducts that are actually for sale.Amazon Examples include use machine learning, including deepnetw learning, for feeds, their pro duct recommendations. selecting posts to display on so social cial network ork news recommending movies to Sometimes, the items are not pro ducts that are actually for sale. Examples include watc atch, h, recommending jok jokes, es, recommending advice from exp experts, erts, matching play players ers selecting p osts to display on so cial netw ork news feeds, recommending movies to for video games, or matc matching hing people in dating services. watch, recommending jokes, recommending advice from experts, matching players Often, this asso association ciation problem is handled lik likee a sup supervised ervised learning problem: for video games, or matching people in dating services. giv given en some information ab about out the item and ab about out the user, predict the proxy of Often, this asso ciation problem is handled lik e a clicks supervised learning problem: in interest terest (user clic clicks ks on ad, user en enters ters a rating, user on a “like” button, user giv en some information ab out the item and ab out the user, predict the proxy of buys pro product, duct, user sp spends ends some amount of money on the pro product, duct, user sp spends ends interest (user clic ks onfor ad,the userpro enduct, ters a etc). rating,This useroften clicksends on a up “like” button, user time visiting a page product, being either a buys pro duct, user sp ends some amount of money on the pro duct, user sp ends regression problem (predicting some conditional exp expected ected value) or a probabilistic time visiting a page for the pro duct, etc). This often ends up eing either a classification problem (predicting the conditional probability of bsome discrete regression problem (predicting some conditional exp ected v alue) or a probabilistic ev even en ent). t). classification problem (predicting the conditional probability of some discrete evenThe t). early work on recommender systems relied on minimal information as inputs for these predictions: the user ID and the item ID. In this con context, text, the only The early work on recommender systems relied on minimal information as way to generalize is to rely on the similarity betw etween een the patterns of values of the inputs for these predictions: the user ID and the item ID. In this con text, the only target variable for different users or for differen differentt items. Supp Suppose ose that user 1 and w ay to is to A, relyB on patterns of v1alues the2 user 2 bgeneralize oth lik likee items andthe C.similarity From this,betw we een ma may ythe infer that user and of user target variable for different for D, differen items. Suppbose user and ha hav ve similar tastes. If user 1users likesor item thentthis should e a that strong cue1 that user 22 will bothalso likelik items B and C.based From on this, weprinciple may infer thatunder user 1the and userof2 user like e D. A, Algorithms this come name ha v e similar tastes. If user 1 likes item D, then this should b e a strong cue that col ollab lab labor or orative ative filtering filtering.. Both non-parametric approaches (suc (such h as nearest-neighbor user 2 ds willbased also lik D. Algorithms based ony this principle comeofunder the nameand of metho methods one the estimated similarit similarity betw etween een patterns preferences) col laborativemetho filtering Both non-parametric approaches h asrely nearest-neighbor parametric methods ds .are possible. Parametric metho methods ds(suc often on learning a metho ds based on the estimated similarit y b etw een patterns of preferences) and distributed representation (also called an em emb bedding) for each user and for each parametric metho ds are p ossible. P arametric metho ds often rely on learning item. Bilinear prediction of the target variable (suc (such h as a rating) is a simplea distributed metho representation called an emb edding) eachasuser and onent for each parametric method d that is (also highly successful and oftenfor found a comp component of item. Bilinear prediction of the target variable (such as a rating) is a simple parametric method that is highly successful and often found as a component of 481
CHAPTER 12. APPLICATIONS
state-of-the-art systems. The prediction is obtained by the dot pro product duct bet etw ween the user em emb bedding and the item embedding (p (possibly ossibly corrected by constants that state-of-the-art systems. is obtained dotmatrix product between ˆ the dep depend end only on either the The user prediction ID or the item ID). Letby be the containing R the user embedding thewith itemuser embedding (possibly that A a and B constants our predictions, matrix embeddings in itscorrected rows and by a matrix with ˆ be the matrix containing dep end only on either the user ID or the item ID). Let R item em emb beddings in its columns. Let b and c be vectors that contain resp respectively ectively A aeac B a matrix predictions, matrix user embeddings in its rows with aour kind of bias for each h userwith (representing how grumpy or and positive that user is b c item em b eddings in its columns. Let and b e vectors that contain resp ectively in general) and for eac each h item (represen (representing ting its general popularity). The bilinear a kind of bias for eac h user (representing how grumpy or positive that user is prediction is thus obtained as follows: in general) and for each item (representing its general popularity). The bilinear X ˆ as prediction is thus obtainedR follows: A u,j B j,i. (12.21) u,i = bu + c i + j
ˆ = b +c + R A B . (12.21) Typically one wan ants ts to minimize the squared error bet etw ween predicted ratings ˆ u,i and actual ratings Ru,i. User em R emb beddings and item embeddings can then be T ypically one w an ts to minimize the error to bet een predicted (tw ratings con conv venien eniently tly visualized when they are squared first reduced awlo low w dimension (two o or ˆ and actual ratings R . User embeddings R and item embeddings can then be X three), or they can be used to compare users or items against eac each h other, just con enien visualized One whenwa they firstthese reduced to a low is dimension (two or lik likeevw ord tly embeddings. way y to are obtain embeddings by performing a three), or they can b e used to compare users or items against eac h other, just singular value decomp decomposition osition of the matrix R of actual targets (suc (such h as ratings). lik e w ord embeddings. One wa y to obtain these embeddings is by performing 0 This corresp corresponds onds to factorizing R = U DV (or a normalized variant) in into to thea singular v alue decomp osition of the matrix of actual targets (suc h as ratings). R pro product duct of tw two o factors, the lo lower wer rank matrices A = U D and B = V 0 . One = UD V missing This corresp onds to factorizing (or a normalized variant) into wthe problem with the SVD is that itRtreats the en entries tries in an arbitrary ay ay,, A = U D B = V pro duct of tw o factors, the lo wer rank matrices and . One as if they corresp corresponded onded to a target value of 0. Instead we would lik likee to avoid problem with the SVD is that it treats the missing en tries in an arbitrary ay, pa paying ying any cost for the predictions made on missing en entries. tries. Fortunately ortunately,, wthe as if they corresp onded to a target v alue of 0. Instead we w ould lik e to a v oid sum of squared errors on the observ observed ed ratings can also be easily minimized by pa ying any cost for the predictions made on bilinear missingprediction entries. F the gradien optimization. The SVD and the ofortunately Eq. 12.21 ,both gradient-based t-based sum of squared on comp the observ ratings can also easily minimized by, p erformed very werrors ell in the competition etitionedfor the Netflix prizebe(Bennett and Lanning gradien t-based at optimization. The SVD the based bilinearonly prediction of Eq. ratings 12.21 both 2007 2007), ), aiming predicting ratings forand films, on previous by p erformed very w ell in the comp etition for the Netflix prize ( Bennett and Lanning a large set of anonymous users. Man Many y machine learning exp experts erts participated in, 2007 aiming at predicting based on It previous ratings this ), comp which to place bfor et een 2006 andonly 2009. raised the lev of competition, etition, took okratings etw wfilms, level elby a large set of anonymous users. Man y machine learning exp erts participated in researc research h in recommender systems using adv advanced anced machine learning and yielded this comp etition, which to ok place b et w een 2006though and 2009. It raised thebylevitself, el of impro improv vemen ements ts in recommender systems. Even it did not win researc h in recommender systems using adv anced machine learning and yielded the simple bilinear prediction or SVD was a comp component onent of the ensem ensemble ble mo models dels impro v emen ts in recommender systems. Even though it did not win by itself, presen presented ted by most of the comp competitors, etitors, including the winners (Töscher et al., 2009; the simple prediction or SVD was a component of the ensemble models K oren , 2009bilinear ). presented by most of the competitors, including the winners (Töscher et al., 2009; Bey Beyond ond these bilinear mo models dels with distributed represen representations, tations, one of the first Koren, 2009). uses of neural net netw works for collab collaborative orative filtering is based on the RBM undirected Bey ond these bilinear mo dels distributed represen one of the first probabilistic mo model del (Salakhutdino Salakhutdinov vwith et al. al., , 2007). RBMs wer weretations, e an imp important ortant element uses neural net orks for orative is based on (the RBM of theofensemble ofwmetho methods ds collab that won the filtering Netflix comp competition etition Tösc Töscher her etundirected al. al.,, 2009; probabilistic mo del ( Salakhutdino v et al. , 2007 ). RBMs wer e an imp ortant Koren, 2009). More adv advanced anced varian ariants ts on the idea of factorizing the ratingselement matrix of the ensemble of metho ds that won the Netflix comp etition ( Tösc her et al. ,v2009 ha hav ve also been explored in the neural netw networks orks communit community y (Salakh Salakhutdino utdino utdinov and; Koren, 2009). More advanced variants on the idea of factorizing the ratings matrix have also been explored in the neural 482 networks community (Salakhutdinov and
CHAPTER 12. APPLICATIONS
Mnih, 2008). Ho How wev ever, er, there is a basic limitation of collab collaborative orative filtering systems: when a Mnih, 2008). new item or a new user is introduced, its lack of rating history means that there Howwaev is a its basic limitation of other collaborative systems: whenor a is no y er, to there ev evaluate aluate similarity with items orfiltering users (resp (respectively), ectively), new degree item orofa association new user is bintroduced, lacknew of rating history means that there the et etw ween, say say,its , that user and existing items. This is called no waythe to problem evaluate of itscold-start similarityrecommendations. with other items or (resp or is A users general waectively), y of solving the cold-start degree of association between, say, that user existing items.ab This the recommendation problem is to new in intro tro troduce duceand extra information about out is called the problem of cold-start recommendations. A general w a y of solving the individual users and items. For example, this extra information could be user the cold-start recommendation is to introduce information about profile information or features of problem each item. Systems that extra use such information are the individual users items. Forsystems example, this mapping extra information could user called content-b ontent-base ase ased d and recommender systems. . The from a rich set bofe user profile information or features each item.can Systems that use such ainformation are features or item features to anofembedding be learned through deep learning called content-b ased et recal. ommender systems . The mapping from a rich set of user arc architecture hitecture (Huang , 2013; Elk Elkahky ahky et al. al.,, 2015 ). features or item features to an embedding can be learned through a deep learning Sp Specialized ecialized deep learning arc architectures hitectures suc such h as con conv volutional net netw works ha hav ve also architecture (Huang et al., 2013; Elkahky et al., 2015). been applied to learn to extract features from ric rich h conten contentt such as from musical Sptracks, ecializedfordeep learning architectures(vsuc as Oörd convolutional net). works havewalso audio music recommendation an hden et al. al.,, 2013 In that ork, b een applied to learn to extract features from ric h conten t such as from m usical the con conv volutional net takes acoustic features as input and computes an embedding audio tracks, for music (van deneen Oörd al., 2013 ). In that for the asso associated ciated song.recommendation The dot pro product duct betw etween thisetsong em emb bedding andwork, the thebcon volutional net is takes input aand computes an to embedding em emb edding for a user thenacoustic used tofeatures predict as whether user will listen the song. for the associated song. The dot product between this song embedding and the embedding for a user is then used to predict whether a user will listen to the song. 12.5.1.1 Exploration Versus Exploitation 12.5.1.1 Exploration Versus When making recommendations to Exploitation users, an issue arises that go goes es beyond ordinary sup supervised ervised learning and in into to the realm of reinforcement learning. Man Many y recommenWhen recommendations to users, an issue arises that go beyond ordinary dationmaking problems are most accurately describ described ed theoretically asescontextual bandits sup ervised learning and in to the realm of reinforcement learning. Man y recommen(Langford and Zhang, 2008; Lu et al., 2010). The issue is that when we use the dation problems are mosttoaccurately describ contextual view bandits recommendation system collect data, we ed gettheoretically a biased andasincomplete of (the Langford and Zhang , 2008 ; Lu et al. , 2010 ). The issue is that when we use the preferences of users: we only see the resp responses onses of users to the items they were recommendation e get a biased view of recommended andsystem not to to thecollect other data, items.wIn addition, in and someincomplete cases we ma may y not the an preferences of users: we only the resp onses of users to has the items they were get any y information on users for see whom no recommendation been made (for recommended and not to the other items. In addition, in some cases we ma y not example, with ad auctions, it ma may y be that the price prop proposed osed for an ad was belo elow w get an y information on users for whom no recommendation has b een made (for a minimum price threshold, or do does es not win the auction, so the ad is not shown at example, ad auctions, mano y binformation e that the price prop osedoutcome for an adww as bha elovwe all). Morewith imp importan ortan ortantly tly tly,, we itget ab about out what ould hav a minimum price threshold, oran do auction, the ad nottraining shown at resulted from recommending any y es of not the win otherthe items. Thissowould beislike a all). More imp ortan tly , we get no information ab out what outcome w ould ha v classifier by picking one class yˆ for each training example x (t (typically ypically the classe resulted from recommending an y of the other items. This would e likegetting trainingasa with the highest probability according to the mo model) del) and thenbonly classifier by picking training (typically class ˆ for each feedbac feedback k whether thisone wasclass the ycorrect class or not.example Clearly Clearly,, x each examplethe conv conveys eys with the highest than probability the where model)the and then only as y getting less information in the according sup supervised ervisedtocase true label is directly feedback whether was theare correct class. or not. if Clearly , each examplewe conv eys accessible, so morethis examples necessary necessary. Worse, we are not careful, could y less information than in the sup ervised case where the true label is directly end up with a system that contin continues ues picking the wrong decisions ev even en as more accessible, so more examples are necessary. Worse, if we are not careful, we could 483 end up with a system that continues picking the wrong decisions even as more
CHAPTER 12. APPLICATIONS
and more data is collected, because the correct decision initially had a very low probabilit probability: y: un until til the learner picks that correct decision, it do does es not learn ab about out and more data is collected, b ecause the correct decision initially had a very low the correct decision. This is similar to the situation in reinforcement learning probabilit unreward til the for learner picks that correct decision, does not learn aboutt where onlyy:the the selected action is observ observed. ed. Initgeneral, reinforcemen reinforcement the correct decision. This is similar to the situation in reinforcement learning can inv involve olve a sequence of many actions and man many y rewards. Thelearning bandits where only reward forofthe selected action is observ Inh general, reinforcemen scenario is athe sp special ecial case reinforcemen reinforcement t learning, in ed. whic which the learner takes onlyt can invand olvereceives a sequence of many actions manproblem y rewards. The bandits alearning single action a single rew reward. ard. Theand bandit is easier in the scenario is a sp ecial case of reinforcemen t learning, in whic h the learner takes sense that the learner kno knows ws whic which h reward is associated with whic which h action.only In a single action and receives a single rew ard. The bandit problem is easier in the the general reinforcement learning scenario, a high rew reward ard or a low rew reward ard might sense the learner knowsaction whichorreward associated with whic h action. In ha hav ve bthat een caused by a recent by an is action in the distant past. The term generalbandits reinforcement learning scenario, a high reward or a low rewcon ardtext might cthe ontextual refers to the case where the action is taken in the context of ha v e b een caused b y a recent action or by an action in the distant past. The term some input variable that can inform the decision. For example, we at least know contextual bandits refers the to case where the action is taken infrom the context context to of the user iden identit tit tity y, and wetowant pic pick k an item. The mapping some input can inform the decision. Fworeen example, we at least action is alsovariable called athat policy olicy. . The feedback lo loop op bet etw the learner and theknow data the user iden tit y , and we w ant to pic k an item. The mapping from context to distribution (whic (which h no now w dep depends ends on the actions of the learner) is a central research actioninisthe alsoreinforcement called a policylearning . The feedback loop literature. between the learner and the data issue and bandits distribution (which now depends on the actions of the learner) is a central research Reinforcemen Reinforcementt learning requires choosing a tradeoff betw etween een explor exploration ation and issue in the reinforcement learning and bandits literature. exploitation exploitation.. Exploitation refers to taking actions that come from the curren current, t, best Reinforcemen t learning requires choosing a tradeoff b etw een explor ation and version of the learned policy—actions that we know will achiev achievee a high reward. exploitation. refers Exploitation refers to taking actions in that come themore curren t, best Exploration to taking actions sp specifically ecifically order to from obtain training vdata. ersionIf of the learned p olicy—actions that we know will achiev e a high reward. we kno know w that giv given en con context text x, action a giv gives es us a reward of 1, we do not Exploration taking ecifically order tot obtain more kno know w whetherrefers that to is the bestactions possiblespreward. Weinmay wan ant to exploit ourtraining current x a data. If we kno w that giv en con text , action giv es us a reward of 1, we do nota policy and contin continue ue taking action a in order to be relativ relatively ely sure of obtaining kno w whether that is est palso ossible reward. We may want to exploit a 0. our rew reward ard of 1. Ho How wev ever, er,the webmay wan want t to explore by trying action We current do not a p olicy and contin ue taking action in order to b e relativ ely sure of obtaining 0 kno know w what will happ happen en if we try action a . We hop hopee to get a reward of 2, but wea a. W reward 1. of Hogetting wever, we may also t to explore action e wledge. do not run the of risk a reward of 0wan . Either way, webyattrying least gain some kno knowledge. know what will happen if we try action a . We hope to get a reward of 2, but we can bea reward implemen implemented in many ways, from occasionally run Exploration the risk of getting of ted 0. Either way,wa weys,at ranging least gain some knowledge. taking random actions intended to cov cover er the entire space of possible actions, to Exploration can be implemen ted ina many ranging from occasionally mo model-based del-based approaches that compute choicewa ofys, action based on its exp expected ected taking touncertaint cover theyentire of pard. ossible actions, to rew reward ard random and the actions mo model’s del’s intended amoun amountt of uncertainty ab about outspace that rew reward. model-based approaches that compute a choice of action based on its expected Man Many y factors determine the exten extentt to whic which h we prefer exploration or exploitation. reward and the model’s amount of uncertainty about that reward. One of the most prominent factors is the time scale we are in interested terested in. If the Man y factors determine the exten t to whic h we prefer exploration or prefer exploitation. agen agentt has only a short amount of time to accrue reward, then we more One of the most prominent factors the time scale reward, we are in terested in. Ifwith the exploitation. If the agent has a longistime to accrue then we begin agent exploration has only a so short of timecan to baccrue reward, weely prefer more thatamount future actions e planned more then effectiv effectively with more more exploitation. If the agent has a long time to accrue reward, then we b egin with kno knowledge. wledge. As time progresses and our learned policy improv improves, es, we mo mov ve tow toward ard more exploration so that future actions can b e planned more effectiv ely with more more exploitation. knowledge. As time progresses and our learned policy improves, we move toward Sup Supervised ervised learning has no tradeoff betw between een exploration and exploitation more exploitation. Supervised learning has no tradeoff 484 between exploration and exploitation
CHAPTER 12. APPLICATIONS
because the sup supervision ervision signal alwa always ys sp specifies ecifies which output is correct for each input. There is no need to try out differen differentt outputs to determine if one is better because supervision signal alwa spys ecifies output each than the the mo model’s del’s current output—w output—we e ys alwa always knowwhich that the lab label elisiscorrect the bestfor output. input. There is no need to try out different outputs to determine if one is better difficulty arising in the econ context text reinforcemen reinforcement t learning, esides the thanAnother the model’s current output—w alwa ys of know that the lab el is the bbest output. exploration-exploitation trade-off, is the difficult difficulty y of ev evaluating aluating and comparing Another difficulty arising in tthe context ofolv reinforcemen t learning, esides the differen different t policies. Reinforcemen Reinforcement learning inv involv olves es interaction betw etween een bthe learner exploration-exploitation trade-off, the y of it evis aluating and comparing and the environmen environment. t. This feedbackis lo loop op difficult means that not straightforw straightforward ard to differen t p olicies. Reinforcemen t learning inv olv es interaction b etw een the learner ev evaluate aluate the learner’s performance using a fixed set of test set input values. The and theitself environmen t. This feedback op means thatDudik it is not ard to policy determines whic which h inputslowill be seen. et straightforw al. (2011) present ev aluate theforlearner’s performance a fixed set of test set input values. The tec techniques hniques ev evaluating aluating contextualusing bandits. policy itself determines which inputs will be seen. Dudik et al. (2011) present techniques for evaluating contextual bandits.
12.5.2
Kno Knowledge wledge Represen Representation, tation, Reasoning and Question Answ swering ering 12.5.2 Knowledge Representation, Reasoning and Question AnDeep learning approaches hav havee been very successful in language mo modeling, deling, mac machine hine swering
translation and natural language processing due to the use of embeddings for Deep learning approaches e been very successful in language modeling, mac sym symb bols (Rumelhart et al.,hav 1986a ) and words (Deerwester et al., 1990 ; Bengio ethine al., translation and natural language processing due to the use of embeddings for 2001 2001). ). These embeddings represent seman semantic tic knowledge ab about out individual words sym b ols ( Rumelhart et al. , 1986a ) and words ( Deerwester et al. 1990 ; Bengio et for al., and concepts. A researc research h frontier is to develop embeddings ,for phrases and 2001 ). These embeddings represent seman tic knowledge ab out individual w ords relations bet etw ween words and facts. Searc Search h engines already use machine learning for and purp concepts. A muc researc h frontier is to to bdevelop embeddings for phrases and for this purpose ose but much h more remains e done to improv improvee these more adv advanced anced relationstations. between words and facts. Search engines already use machine learning for represen representations. this purpose but much more remains to be done to improve these more advanced representations. 12.5.2.1 Kno Knowledge, wledge, Relations and Question Answ Answering ering 12.5.2.1 KnoOne wledge, Relations anddirection Question Answeringhow distributed indexRelations interesting research is determining represen representations tations can be trained to capture the relations bet etw ween two entities. These indexRelations One research direction determining how distributed relations allow us to interesting formalize facts ab about out ob objects jectsisand how ob objects jects interact with represen tations can b e trained to capture the r elations b et w een t w o entities. These eac each h other. relations allow us to formalize facts about ob jects and how ob jects interact with In mathematics, a binary relation is a set of ordered pairs of ob objects. jects. Pairs each other. that are in the set are said to hav the relation while those who are not in the set havee In mathematics, a binary r elation is a set of ordered pairs of ob jects. Pairs do not. For example, we can define the relation “is less than” on the set of entities set arethe said e the relation while are not in thethis set 1 , 2 ,are 3} in (1,,those 2) 2),, (1 (1,,who 3) 3),, (2 (2, , 3) 3)} {that {(1 }. Once by the defining settoofhav ordered pairs S = do not. F or example, we can define the relation “is less than” on the set of entities relation is defined, we can use it lik likee a verb. SBecause (1 (1,, 2) ∈ S , we sa say y that 1 is 1 , 2 , 3 = (1 , 2) , (1 , 3) , (2 , 3) b y defining the set of ordered pairs . Once this ∈ 6 S 1) less than 2. Because (2 (2,, , we can not sa say y that 2 is less than S 1. Of course, the relation is defined, we can it like a verb. Becausenumb (1, b2)ers. W , we 1 isa { tities} that }say that en entities are related to use one e could define S another need not {be num 1) less thanis_a_type_of 2. Because (2, containing , we can notlik saey(that is less).than ∈ 1. Of course, the relation tuples like dog, 2mammal entities that are related to6∈one another need not be numbers. We could define a In the context of AI, we thinktuples of a relation a sentence syntactically tactically is_a_type_of relation containing like (dogas , mammal ). in a syn In the context of AI, we think of a485 relation as a sentence in a syntactically
CHAPTER 12. APPLICATIONS
simple and highly structured language. The relation plays the role of a verb, while two argumen arguments ts to the relation play the role of its sub subject ject and ob object. ject. These simple and highly structured language. The relation plays the role of a verb, sen sentences tences take the form of a triplet of tokens while two arguments to the relation play the role of its sub ject and ob ject. These sentences take the form of a triplet of, tokens (sub (subject ject verb, ob object) ject) (12.22) with values
(sub ject, verb, ob ject) (en (entit tit tity yi , relationj, en entit tit tity y k ).
(12.22) (12.23)
with values (entity , relation , enanalogous tity ). to a relation, but(12.23) We can also define an attribute a concept taking only one argument: We can also define an attribute , ya ,concept analogous to a relation, but(12.24) taking (en (entit tit tity i attribute j ). only one argument: has_fur For example, we could define the attribute, lik likee (entit y , attribute ). and apply it to entities (12.24) dog dog.. For example, we could define the has_fur attribute, and apply it to entities like Many y applications require representing relations and reasoning ab about out them. dogMan . Ho How w should we best do this within the context of neural netw networks? orks? Many applications require representing relations and reasoning about them. Mac Machine hine learning mo models dels of course require training data. We can infer relations How should we best do this within the context of neural networks? bet etw ween entities from training datasets consisting of unstructured natural language. Mac hine learning models of coursethat require training data. explicitly We can infer relations There are also structured databases iden identify tify relations explicitly. . A common between entities fromdatabases training datasets consistingdatab of unstructured naturalthis language. structure for these is the relational database ase ase,, whic which h stores same There are also structured databases that iden tify relations explicitly . A common kind of information, alb albeit eit not formatted as three tok token en sen sentences. tences. When a structure for these databases is the r elational datab ase , whic h this same database is in intended tended to conv convey ey commonsense knowledge ab about out stores everyda everyday y life or kind information, albeit not formatted token sen tences. When a exp expert ertofkno knowledge wledge ab about out an application areaastothree an artificial in intelligence telligence system, database is in tended to convey commonsense knowledge about everyda life or w e call the database a know knowle le ledge dge base ase.. Knowledge bases range from ygeneral expert kno wledge ab out an application area to an artificial intelligence system, Freebase Freebase, OpenCyc, Wikibase Wikibase, ones like , OpenCyc , WordNet, or ,1 etc. to more sp specialized ecialized w e call the database a know le dge b ase . Knowledge bases range from general GeneOntology.. 2 Represen kno knowledge wledge bases, like GeneOntology Representations tations for entities and relations Freebase , OpenCyc, WordNet, or Wikibase, etc. to more specialized ones like can be learned by considering eac triplet in a kno each h knowledge wledge base as a training example GeneOntology kno wledge bases, like . Represen tations and relations and maximizing a training ob objectiv jectiv jectivee that captures their for jointentities distribution (Bordes can b e learned by considering eac h triplet in a kno wledge base as a training example et al. al.,, 2013a). and maximizing a training ob jective that captures their joint distribution (Bordes In addition to training data, we also need to define a mo model del family to train. et al., 2013a). A common approach is to extend neural language mo models dels to mo model del en entities tities and In addition to training data, we also need to define a mo del family to train. relations. Neu Neural ral language mo models dels learn a vector that provides a distributed A common approach extend language dels to model eneen tities and represen representation tation of eac each hiswto ord. Theyneural also learn ab about outmo interactions betw etween words, relations. Neu ral language mo dels learn a vector that provides a distributed suc such h as which word is likely to come after a sequence of words, by learning functions represen tation ofWeac h wextend ord. They learn to aben out interactions betwby eenlearning words, of these vectors. e can this also approach entities tities and relations suchem asbwhich ord is likely to hcome after aIn sequence of w ords, byblearning functions an emb eddingwv ector for eac each relation. fact, the parallel etw etween een mo modeling deling of these vectors. We can extend this approach to entities and relations by learning 1 cyc.com/opencyc, wordnet. from web sites:Infreebase.com, an Respectively embedding available vector for eacthese h relation. fact, the parallel between mo deling princeton.edu, wikiba.se 2 geneontology.org
486
CHAPTER 12. APPLICATIONS
language and mo modeling deling kno knowledge wledge enco encoded ded as relations is so close that researchers ha hav ve trained represen representations tations of suc such h en entities tities by using both knowledge bases and language and mo deling kno wledge enco ded relations closeetthat natural language sen sentences tences (Bordes et al.,as2011 , 2012is; so Wang al. al.,,researchers 2014a) or ha v e trained represen tations of suc h en tities by using b oth knowledge bases and com combining bining data from multiple relational databases (Bordes et al., 2013b). Many natural language sen tences ( Bordes et al. , 2011 , 2012 ; W ang et al. , 2014a ) or possibilities exist for the particular parametrization asso associated ciated with suc such h a mo model. del. combining from multiple relationalbdatabases (Bordes et al., 2013b Many, Early workdata on learning ab about out relations etw etween een en entities tities (Paccanaro and).Hinton possibilities forconstrained the particular parametrization associated with em sucbheddings”), a model. 2000 2000) ) positedexist highly parametric forms (“linear relational emb Early using work aondifferent learningform about relations between entities (Paccanaro Hinton, often of representation for the relation than for and the entities. 2000 ) p osited highly constrained parametric forms (“linear relational em b eddings”), For example, Paccanaro and Hinton (2000) and Bordes et al. (2011) used vectors for often using different of representation thearelation fore an theop entities. en entities tities and amatrices forform relations, with the ideafor that relation than acts lik like operator erator F or example, Paccanaro and Hinton ( 2000 ) and Bordes et al. ( 2011 ) used v ectors for on en entities. tities. Alternatively Alternatively,, relations can be considered as any other en entit tit tity y (Bordes entities and),matrices with thetsidea thatrelations, a relationbut actsmore like flexibilit an operator et al. al.,, 2012 allowingfor us relations, to make statemen statements ab about out flexibility y is on en tities. Alternatively , relations can b e considered as any other en tit y ( Bordes put in the machinery that combines them in order to mo model del their joint distribution. et al., 2012), allowing us to make statements about relations, but more flexibility is A practical short-term application of suc such h mo models dels is link pr preediction diction:: predicting put in the machinery that combines them in order to model their joint distribution. missing arcs in the kno knowledge wledge graph. This is a form of generalization to new A practical short-term application such models is link ediction: exist predicting facts, based on old facts. Most of the of knowledge bases thatprcurrently hav havee missing arcs in the kno wledge graph. This is a form of generalization to new been constructed through manual lab labor, or, whic which h tends to leav leavee many and probably facts, based Mostabsen of the knowledge bases that currently hav e the ma majorit jorit jority yon of old truefacts. relations absent t from the kno knowledge wledge base. See Wexist ang et al. een constructed manual labor, whiceth al. tends to )leav many andofprobably (b2014b ), Lin et al.through (2015) and Garcia-Duran (2015 foreexamples suc such h an the ma jorit y of true relations absen t from the kno wledge base. See W ang et al. application. (2014b), Lin et al. (2015) and Garcia-Duran et al. (2015) for examples of such an Ev Evaluating aluating the performance of a mo model del on a link prediction task is difficult application. because we hav havee only a dataset of positiv ositivee examples (facts that are kno known wn to Ev aluating the p erformance of a mo del on a link prediction task is difficult be true). If the mo model del prop proposes oses a fact that is not in the dataset, we are unsure b ecause we hav e only a dataset of positiv e examples (factspreviously that are kno wnwn to whether the mo model del has made a mistake or disco discov vered a new, unkno unknown b e true). If the mo del prop oses a fact that is not in the dataset, w e are unsure fact. The metrics are thus somewhat imprecise and are based on testing ho how w the whether the mo del has made or disco veredfacts a new, previously unkno wn mo model del ranks a held-out of set aofmistake known true positive compared to other facts fact. are Theless metrics somewhat imprecise and are based on testingexamples how the that lik likely elyare to thus be true. A common wa way y to construct interesting modelare ranks a held-out of eset of known trueprobably positive facts towith othera facts that probably negativ negative (facts that are false)compared is to begin true that are less lik ely to b e true. A common wa y to construct interesting examples fact and create corrupted versions of that fact, for example by replacing one entit entity y that probably e (facts that are probably false)The is to begin precision with a true in theare relation withnegativ a different entit entity y selected at random. popular at fact and create corrupted versions of that fact, example by replacing one entit 10% metric counts how man many y times the mo model delfor ranks a “correct” fact among they in the relation a different entitofy that selected top 10% of all with corrupted versions fact.at random. The popular precision at 10% metric counts how many times the model ranks a “correct” fact among the Another application of knowledge bases and distributed represen representations tations for top 10% of all corrupted versions of that fact. them is wor word-sense d-sense disambiguation (Na Navigli vigli and Velardi, 2005; Bordes et al. al.,, 2012), Another application of knowledge bases and distributed represen tations for whic which h is the task of deciding whic which h of the senses of a word is the appropriate one, them is wor d-sense disambiguation (Navigli and Velardi, 2005; Bordes et al., 2012), in some context. which is the task of deciding which of the senses of a word is the appropriate one, Ev Even en entually tually tually,, knowledge of relations com combined bined with a reasoning process and in some context. understanding of natural language could allo allow w us to build a general question Eventually, knowledge of relations combined with a reasoning process and understanding of natural language could 487 allow us to build a general question
CHAPTER 12. APPLICATIONS
answ answering ering system. A general question answering system must be able to pro process cess input information and remem rememb ber imp important ortant facts, organized in a wa way y that enables answ system. A general answering system amust be able toproblem process it to ering retrieve and reason ab about outquestion them later. This remains difficult op open en inputh information remem er important organized a way tly that enables whic which can only beand solv solved ed in brestricted “toy”facts, en environmen vironmen vironments. ts.inCurren Currently tly, , the best it to retrieve and reason ab out them later. This remains a difficult op en problem approac approach h to remem rememb bering and retrieving sp specific ecific declarative facts is to use an whic h can only bmec e solv ed in restricted “toy” environmen Curren tly,orks the w best explicit memory mechanism, hanism, as describ described ed in Sec. 10.12. ts. Memory netw networks ere approac h to remem b ering and retrieving sp ecific declarative facts is to use an first prop proposed osed to solv solvee a to toy y question answering task (Weston et al. al.,, 2014). Kumar explicit memory hanism, as extension described that in Sec. 10.12 . Memory netw orks ere et al. (2015 ) hav haveemec prop proposed osed an uses GRU recurrent nets to w read first prop osed to solv e a to y question answering task ( W eston et al. , 2014 ). Kumar the input in into to the memory and to pro produce duce the answer giv given en the con conten ten tents ts of the et al. (2015 memory memory. . ) have proposed an extension that uses GRU recurrent nets to read the input into the memory and to produce the answer given the contents of the Deep learning has been applied to many other applications besides the ones memory. describ described ed here, and will surely be applied to ev even en more after this writing. It would Deep learning has b een applied to many other applications besidese the ones be imp impossible ossible to describ describee anything remotely resembling a comprehensiv comprehensive co cov verage describ here, and will surelyprovides be applied to even moree after thisofwriting. would of suched a topic. This survey a representativ representative sample what is It possible be of imp ossible to describe anything remotely resembling a comprehensive coverage as this writing. of such a topic. This survey provides a representative sample of what is possible This concludes Part II, which has describ described ed mo modern dern practices inv involving olving deep as of this writing. net netw works, comprising all of the most successful metho methods. ds. Generally sp speaking, eaking, these This Partthe II,gradient which has ed modern practices involving deep metho methods ds concludes inv involve olve using of describ a cost function to find the parameters of a net w orks, comprising all of the most successful metho ds. Generally sp eaking, these mo model del that approximates some desired function. With enough training data, this metho dsh inv olve usingpthe gradient of wa turn cost function find theh parameters approac approach is extremely ow owerful. erful. We no now to Part IIto I, in whic which we step intoof thea mo del that approximates some desired function. With enough training data, this territory of research—methods that are designed to work with less training data approac h is extremely pow erful. W now turn to Part II, in whicare h wemore step difficult into the or to perform a greater variety ofetasks, where the cIhallenges territory of close research—methods aresituations designed we to work less and not as to being solvedthat as the hav havee with describ described ed training so far. data or to perform a greater variety of tasks, where the challenges are more difficult and not as close to being solved as the situations we have described so far.
488
Part III Part III
Deep Learning Researc Research h Deep Learning Research
489 489
This part of the book describ describes es the more am ambitious bitious and adv advanced anced approac approaches hes to deep learning, curren currently tly pursued by the researc research h comm communit unit unity y. This part of the book describes the more ambitious and advanced approaches In the previous parts of the book, we ha have ve sho shown wn how to solv solvee sup supervised ervised to deep learning, currently pursued by the research community. learning problems—how to learn to map one vector to another, given enough In theofprevious parts of the book, we have shown how to solve supervised examples the mapping. learning problems—how to learn to map one vector to another, given enough Not all problems we might want to solve fall in into to this category category.. We ma may y examples of the mapping. wish to generate new examples, or determine how likely some poin ointt is, or handle Not all problems w e might w ant to solve fall in to this category . examples We may missing values and take adv advan an antage tage of a large set of unlab unlabeled eled examples or wish to generate newAexamples, or determine howt likely ointfor is, industrial or handle from related tasks. shortcoming of the curren current state some of thepart missing valuesisand antagealgorithms of a large set of unlab eledamounts examplesoforsup examples applications thattake ouradv learning require large supervised ervised from related tasks. A shortcoming of the curren t state of the art for industrial data to ac achieve hieve go goo od accuracy accuracy.. In this part of the book, we discuss some of applications is that our learning algorithms amounts of sup ervised the sp speculative eculative approac approaches hes to reducing therequire amoun amountlarge t of lab labeled eled data necessary data to ac hieve go o d accuracy . In this part of the b o ok, w e discuss some of for existing mo models dels to work well and be applicable across a broader range of the speculative approac hes goals to reducing amoun t of form labeled data ervised necessary tasks. Accomplishing these usually the requires some of unsup unsupervised or for existing mo dels to work w ell and b e applicable across a broader range of semi-sup semi-supervised ervised learning. tasks. Accomplishing these goals usually requires some form of unsupervised or Man Many y deep learning algorithms ha have ve been designed to tackle unsupervised semi-supervised learning. learning problems, but none hav havee truly solved the problem in the same way that y deep algorithms ha ve been designed to tackle unsupervised deepMan learning haslearning largely solv solved ed the sup supervised ervised learning problem for a wide variet ariety y of learning problems, but none hav e truly solved the problem in the same w a y that tasks. In this part of the book, we describe the existing approaches to unsupervised deep learning has largely ed thethought supervised learning for aprogress wide variet y of learning and some of the solv popular ab about out how wproblem e can make in this tasks. In this part of the book, we describe the existing approaches to unsupervised field. learning and some of the popular thought about how we can make progress in this A central cause of the difficulties with unsupervised learning is the high difield. mensionalit mensionality y of the random variables being mo modeled. deled. This brings two distinct A central cause of the difficulties with unsupervised learning The is thestatistic high dichallenges: a statistical challenge and a computational challenge. statistical al mensionalit y of the random v ariables being mo deled. This brings t wo distinct chal regards generalization: the num er of configurations we may wan to challenge lenge numb b wantt challenges: can a statistical challenge computational challenge. of The statistic al distinguish grow exp exponentially onentially and witha the num numb ber of dimensions in interest, terest, and challenge regards generalization: the num er ofber configurations we may t to this quickly becomes muc much h larger than thebnum numb of examples one can pwan ossibly distinguish grow exponentially with the number ofThe dimensions of interest, and ha hav ve (or usecan with bounded computational resources). computational chal challenge lenge this quickly b ecomes muc h larger than the num b er of examples one can p ossibly asso associated ciated with high-dimensional distributions arises because man many y algorithms for have (or or useusing with abounded computational resources). Theon computational lenge learning trained mo model del (esp (especially ecially those based estimating anchal explicit associatedywith high-dimensional distributions arises because y algorithms for probabilit probability function) in involv volv volvee in intractable tractable computations that man gro grow w exponentially learning or using a trained mo del (esp ecially those based on estimating an explicit with the num umb ber of dimensions. probability function) involve intractable computations that grow exponentially With probabilistic mo models, dels, this computational challenge arises from the need to with the number of dimensions. perform intractable inference or simply from the need to normalize the distribution. With probabilistic models, this computational challenge arises from the need to perform intractable inference from need tomostly normalize the distribution. • In Intractable tractable inferenceor : simply inference is the discussed in Chapter 19. It regards the question of guessing the probable values of some variables a, Intractable inference inference is discussed mostly Chapter . It giv given en other variables b,: with resp respect ect to a model that in captures the19joint regards the question of guessing the probable v alues of some v ariables a, • given other variables b, with resp 490ect to a model that captures the joint
distribution bet etw ween a, b and c. In order to ev even en compute suc such h conditional probabilities one needs to sum over the values of the variables c, as well as distribution between a, b constant and c. Inwhich order sums to even compute suchofconditional compute a normalization ov over er the values a and c. probabilities one needs to sum over the values of the variables c, as well as a normalization constant which sums er the values of a and c. : the • compute In Intractable tractable normalization constants (theovpartition function) partition function is discussed mostly in Chapter 18. Normalizing constants Intractable constants (the partition : the of probabilit probability y normalization functions come up in inference (ab (abov ov ove) e) as wellfunction) as in learning. partition function is discussed mostly in Chapter 18 . Normalizing constants • Man Many y probabilistic models in involv volv volvee suc such h a normalizing constant. Unfortuof probabilit y functions come in inference (above) as the wellgradien as in learning. nately nately, , learning such a mo model del up often requires computing gradient t of the Man y probabilistic models in volv e suc h a normalizing constant. Unfortulogarithm of the partition function with resp respect ect to the mo model del parameters. nately , learning such a mo del often requires computing the t of the That computation is generally as intractable as computinggradien the partition logarithm of the partition function with resp ect to the mo del parameters. function itself. Mon Monte te Carlo Marko Markov v chain (MCMC) metho methods ds (Chapter 17) That computation is with generally as intractable as(computing computing itthe are often used to deal the partition function or partition its gradifunction itself. Mon te Carlo Marko v chain (MCMC) metho ds (Chapter 17) en ent). t). Unfortunately Unfortunately,, MCMC metho methods ds suffer when the mo modes des of the mo model del are often used deal withand thewpartition function (computing it or its gradidistribution aretonumerous ell-separated, esp especially ecially in high-dimensional en t). Unfortunately , MCMC metho ds suffer when the mo des of the model spaces (Sec. 17.5). distribution are numerous and well-separated, especially in high-dimensional spaces 17.5). these intractable computations is to approximate them, One wa way y (Sec. to confront and many approaches hav havee been prop proposed osed as discussed in this third part of the OneAnother way to confront these intractable computations is to approximate book. interesting way ay, , also discussed here, w would ould be to av avoid oid them, these and many approaches hav e b een prop osed as discussed in this third part of the in intractable tractable computations altogether by design, and metho methods ds that do not require b ook. Another interesting way, app also discussed here, woulde bmo e to oid suc such h computations are thus very appealing. ealing. Several generativ generative models delsavha hav ve these been in tractable computations altogether by design, and metho ds that do not require prop proposed osed in recent years, with that motiv motivation. ation. A wide variety of contemporary suc h computations are thus very app ealing. Several generativ20 e .models have been approac approaches hes to generativ generative e mo modeling deling are discussed in Chapter proposed in recent years, with that motivation. A wide variety of contemporary Part III is the most imp important ortant for a researc researcher—someone her—someone who wan wants ts to unapproaches to generative modeling are discussed in Chapter 20. derstand the breadth of persp erspectives ectives that hav havee been brought to the field of deep P art I I I is the most imp ortant for a researc who wants to unlearning, and push the field forward tow towards ards trueher—someone artificial intelligence. derstand the breadth of perspectives that have been brought to the field of deep learning, and push the field forward towards true artificial intelligence.
491
Chapter 13 Chapter 13
Linear Factor Mo Models dels Linear F actor Mo dels Man Many y of the researc research h fron frontiers tiers in deep learning in inv volv olvee building a probabilistic mo model del of the input, pmodel (x). Suc Such h a mo model del can, in principle, use probabilistic inference to Many ofan the h frontiers in deep learning involvany e building a probabilistic del predict any y ofresearc the variables in its en environmen vironmen vironment t given of the other variables. mo Man Many y p ( x of the input, ) . Suc h a mo del can, in principle, use probabilistic inference to of these models also ha have ve laten latentt variables h, with pmodel (x) = Eh pmodel (x | h). predict an y of the v ariables in environmen t given any of the other variables. Many These laten latentt variables provideits another means of representing the data. E Distributed h, with p all (x = adv p antages of these tations models based also haon ve laten latentt v variables ariables can (x hof ). represen representations latent variables obtain of) the advantages These laten t variables provide another of representing theard data. | t represen representation tation learning that w e ha hav ve means seen with deep feedforw feedforward andDistributed recurren recurrent represen tations based on laten t v ariables can obtain all of the adv antages of net netw works. representation learning that we have seen with deep feedforward and recurrent In this chapter, we describ describee some of the simplest probabilistic mo models dels with networks. laten v ariables: linear factor mo These mo are sometimes used as building latentt models. dels. models dels In this chapter, we describ e some of the simplest probabilistic mo dels with; blo bloccks of mixture models (Hin Hinton ton et al. al.,, 1995a; Ghahramani and Hin Hinton ton, 1996 laten t variables: linear dels.probabilistic These models are used as building Ro Row weis et al. al.,, 2002 ) or factor larger,mo deep mo models delssometimes (Tang et al. al., , 2012 ). They blo c ks of mixture models ( Hin ton et al. , 1995a ; Ghahramani and Hin ton , 1996 also show many of the basic approac approaches hes necessary to build generative mo models dels that; Rowmore eis et adv al.,anced 2002) deep or larger, deep models (Tang et al., 2012). They the advanced mo models dels willprobabilistic extend further. also show many of the basic approaches necessary to build generative models that A linear factor mo model del is defined by the use of a stochastic, linear deco decoder der the more advanced deep models will extend further. function that generates x by adding noise to a linear transformation of h. A linear factor model is defined by the use of a stochastic, linear decoder These mo models dels are interesting because they allow us to disco discover ver explanatory function that generates x by adding noise to a linear transformation of h. factors that hav havee a simple joint distribution. The simplicity of using a linear deco decoder der These mo dels are interesting b ecause they allow us to disco ver explanatory made these mo models dels some of the first latent variable mo models dels to be extensively studied. factors that have a simple joint distribution. The simplicity of using a linear decoder A linear factor mo model del describ describes es the data generation pro process cess as follo follows. ws. First, made these models some of the first latent variable models to be extensively studied. we sample the explanatory factors h from a distribution A linear factor model describes the data generation process as follows. First, we sample the explanatory factors hh from ∼ p(ha),distribution (13.1) h with p(h)p,(h) = where p(h) is a factorial distribution, ∼ 492 p(h) = where p(h) is a factorial distribution, with 492
i p(hi),
Q
Q
(13.1) so that it is easy to
p(h ), so that it is easy to
CHAPTER 13. LINEAR FACTOR MODELS
sample from. Next we sample the real-v real-valued alued observ observable able variables giv given en the factors: x =real-v W halued + b +observ noise able variables given the factors: (13.2) sample from. Next we sample the where the noise is typically Gaussian (indep (independent endent across dimensions). x = Wand h +diagonal b + noise (13.2) This is illustrated in Fig. 13.1. where the noise is typically Gaussian and diagonal (independent across dimensions). This is illustrated in Fig. 13.1. h1
h2
h3
x1
x2
x3
x = W h + b + noise
Figure 13.1: The directed graphical mo model del describing the linear factor mo model del family family,, in whic which h we assume that an observed data vector x is obtained by a linear com combination bination of Figureenden 13.1:t The directed model describing thet mo linear family, in indep independen endent latent factorsgraphical noise. Differen Different models, dels,factor such mo as del probabilistic h, plus some x is obtained which factor we assume thatoranICA, observed vector byform a linear comnoise bination PCA, analysis makedata differen different t choices ab about out the of the and of of h indep enden t latent factors , plus some noise. Differen t mo dels, such as probabilistic the prior p(h). PCA, factor analysis or ICA, make different choices about the form of the noise and of the prior p(h).
13.1
Probabilistic PCA and Factor Analysis
Probabilistic PCA (principal PCA comp components onents factor analysis and other linear 13.1 Probabilistic andanalysis), Factor Analysis factor mo models dels are sp special ecial cases of the ab abov ov ovee equations (13.1 and 13.2) and only Probabilistic PCA (principal comp onents analysis), factor analysis and other h blinear differ in the choices made for the mo model’s del’s prior ov over er latent variables efore factor models arenoise special cases of the above equations (13.1 and 13.2) and only observing x and distributions. differ in the choices made for the model’s prior over latent variables h before In factor analysis Bartholomew, 1987; Basilevsky, 1994), the latent variable observing x and noise(distributions. prior is just the unit variance Gaussian In factor analysis (Bartholomew, 1987; Basilevsky, 1994), the latent variable h ∼ N (h; 0, I ) (13.3) prior is just the unit variance Gaussian h. while the observed variables xi arehassumed onditionally ly indep independent endent endent,, given givenh (h;to0,bIe)conditional (13.3) Sp Specifically ecifically ecifically,, the noise is assumed to ∼ beNdra drawn wn from a diagonal cov covariance ariance Gaussian x while the observed v ariables are assumed to b e c onditional ly indep 2 2 diag((σ ), with σ = [[σ σ 21 ,endent σ22, . . .,,given σ 2n]> ha. distribution, with co cov variance matrix ψ = diag Sp ecifically , the noise is assumed to b e dra wn from a diagonal cov ariance Gaussian vector of per-v er-variable ariable variances. distribution, with covariance matrix ψ = diag(σ ), with σ = [σ , σ , . . . , σ ] a The role of the latent variables is thus to captur apturee the dep dependencies endencies betw etween een vector of per-variable variances. the different observed variables x i. Indeed, it can easily be shown that x is just a The role normal of the latent variables thus to capture the dependencies between multiv ultivariate ariate random variable,is with the different observed variables x . Indeed, it can easily be shown that x is just a (13.4) N (x; bwith , W W > + ψ ). multivariate normal randomxv∼ ariable, (x; b,493 W W + ψ ).
x ∼N
(13.4)
CHAPTER 13. LINEAR FACTOR MODELS
In order to cast PCA in a probabilistic framew framework, ork, w wee can mak makee a sligh slightt mo modification dification to the factor analysis mo model, del, making the conditional variances σ2i In to order cast In PCA a probabilistic framew e can slight equal eachto other. thatincase the cov covariance ariance of x isork, justwW W >mak + σ 2eI a, where mo 2 dification to the factor analysis mo del, making the conditional variances σ σ is now a scalar. This yields the conditional distribution equal to each other. In that case the covariance of x is just W W + σ I , where σ is now a scalar. This yields distribution σ 2I ) x ∼ the N (xconditional ; b, W W > + (13.5) or equiv equivalently alently
x
(x; b, W W + σ I ) x = W h + b + σz ∼N
(13.5) (13.6)
or equivalently where z ∼ N (z ; 0, I ) is Gaussian noise. show w an x= W hTipping + b + σzand Bishop (1999) then sho (13.6) 2 iterativ iterativee EM algorithm for estimating the parameters W and σ . (z ; 0, I ) is Gaussian noise. Tipping and Bishop (1999) then show an where z This pr prob ob obabilistic abilistic PCA mo model del tak takes es adv advantage antage of the observ observation ation that most iterative ∼ EM N algorithm for estimating the parameters W and σ . variations in the data can be captured by the latent variables h, up to some This probabilistic PCA moerr del adv antagebyofTipping the observ that most small residual reconstruction error ortak shown andation Bishop (1999 ), σ2 .esAs h v ariations in the data can b e captured b y the latent v ariables , up to some σ → probabilistic PCA becomes PCA as 0. In that case, the conditional exp expected ected small reconstruction error by Tipping Bishop σ . As shown x becomes an b on v alue residual of h giv given en orthogonal pro projection jection of x −and onto to the(1999 space), σ in PCA. probabilistic PCA becomes of PCA 0. In that case, the conditional expected spanned by the d columns W , as like value of h given x becomes an orthogonal pro jection of x b onto the space → As σ → 0, the density mo model del defined by probabilistic PCA becomes very sharp spanned by the d columns of W , like in PCA. − around these d dimensions spanned by the columns of W . This can mak makee the σ As 0 , the density mo del defined b y probabilistic PCA b ecomes very sharp mo model del assign very low likelihoo likelihood d to the data if the data do does es not actually cluster around these → d dimensions spanned by the columns of W . This can make the near a hyperplane. model assign very low likelihood to the data if the data does not actually cluster near a hyperplane.
13.2
Indep Independen enden endentt Comp Componen onen onentt Analysis (ICA)
Indep Independen enden endent t comp component onent analysis (ICA) is among the oldest represen representation tation learning 13.2 Indep enden t Comp onen t Analysis (ICA) algorithms (Herault and Ans, 1984; Jutten and Herault, 1991; Comon, 1994; Indep enden, t1999 comp onent analysis (ICA) is among theetoldest represen learning Hyv Hyvärinen ärinen ; Hyvärinen et al. al., , 2001a ; Hin Hinton ton al. al.,, 2001 ; Tehtation et al. al., , 2003). algorithms (Herault anddeling Ans, linear 1984; factors Jutten that and seeks Herault 1991; Comon , 1994; It is an approac approach h to mo modeling to ,separate an observed Hyvärinen 1999; underlying Hyvärinen signals et al., 2001a Hin ton etand al., added 2001; T eh et al.to , 2003 signal in into to, many that ;are scaled together form). It isobserved an approac h to modeling factors to separate observed the data. These signalslinear are in intended tended that to beseeks fully indep independen enden endent, t,an rather than signal in to many underlying signals that are scaled and added together to form 1 merely decorrelated from each other. the observed data. These signals are intended to be fully independent, rather than Man Many y different sp specific ecific metho methodologies dologies are referred to as ICA. The variant merely decorrelated from each other. that is most similar to the other generative mo models dels we hav havee describ described ed here is a Man y different sp ecific metho dologies are referred to as ICA. The variant varian ariantt (Pham et al., 1992) that trains a fully parametric generative mo model. del. The that is most similar to the other generative mo dels w e hav e describ ed here prior distribution ov over er the underlying factors, p(h ), must be fixed ahead of timeisbya varian t (Pham et del al., then 1992)deterministically that trains a fully parametric model. The x = generative W h. We can the user. The mo model generates perform prior distribution over the underlying factors, p(h ), must be fixed ahead of time by 1 Sec.The 3.8 for discussion of the difference between uncorrelated variables x=W h. Weand the See user. moadel then deterministically generates canindependent perform variables.
494
CHAPTER 13. LINEAR FACTOR MODELS
a nonlinear change of variables (using Eq. 3.47) to determine p (x ). Learning the mo model del then pro proceeds ceeds as usual, using maximum likelihoo likelihood. d. a nonlinear change of variables (using Eq. 3.47) to determine p (x ). Learning the The motiv motivation ation for this approach is that by cho hoosing osing ) to be indep independen enden endent, t, model then proceeds as usual, using maximum likelihood.p (h we can reco recover ver underlying factors that are as close as possible to independent. The motiv ationused, for this is that by choabstract osing p (hcausal ) to befactors, independen t, This is commonly notapproach to capture high-level but to w e can verelunderlying factors as close as possible to setting, independent. reco recov ver reco lo low-lev w-lev w-level signals that hav havee that beenare mixed together. In this each This is commonly used, not to capture high-level abstract causal factors, but to training example is one moment in time, each x i is one sensor’s observ observation ation of recomixed ver low-lev el signals that e been mixedof together. this setting, h i hav the signals, and each is one estimate one of the In original signals.each For x training example is one moment in time, each is one sensor’s observ ation oft example, we migh mightt hav havee n people sp speaking eaking simulta simultaneously neously neously.. If we hav havee n differen different the mixed signals, each h lo iscations, one estimate of detect one ofthe thechanges originalinsignals. For microphones placedand in different locations, ICA can the volume example, we migh t havas e nheard people eaking simultaneously If we hav e nsignals differensot b et etw ween each sp speaker eaker byspeach microphone, and .separate the microphones in different ICA can clearly detect .the changes in the volume that each hi placed con contains tains only onelopcations, erson sp speaking eaking clearly. This is commonly used b et w een each sp eaker as heard by each microphone, and separate the signals so in neuroscience for electro electroencephalograph encephalograph encephalography y, a technology for recording electrical h contains that each onlybrain. one pMany erson sp eaking . This is on commonly used signals originating in the electro electrode de clearly sensors placed the sub subject’s ject’s in neuroscience formeasure electroencephalograph a technology recording head are used to many electricaly, signals comingforfrom the boelectrical dy dy.. The signals originating in the brain. Many electro de sensors placed on the sub exp experimen erimen erimenter ter is typically only interested in signals from the brain, but signals ject’s from head are used to measure many electrical signals coming from the b o dy . The the sub subject’s ject’s heart and eyes are strong enough to confound measurements taken expthe erimen ter is typically only signals interested inesignals thedes brain, but together, signals from at sub subject’s ject’s scalp. The arriv arrive at thefrom electro electrodes mixed so the sub ject’s heart and eyes are strong enough to confound measurements taken ICA is necessary to separate the electrical signature of the heart from the signals at the sub ject’s arrivesignals at the inelectro des mixed together, so originating in thescalp. brain,The andsignals to separate different brain regions from ICA is necessary to separate the electrical signature of the heart from the signals eac each h other. originating in the brain, and to separate signals in different brain regions from As mentioned before, man many y variants of ICA are possible. Some add some noise each other. in the generation of x rather than using a deterministic deco decoder. der. Most do not As mentioned b efore, man y v ariants of ICA are p ossible. Some addelements some noise use the maxim maximum um lik likeliho eliho elihoo od criterion, but instead aim to mak makee the of x in the generation of rather than using a deterministic deco der. Most do not −1 h = W x indep independen enden endentt from each other. Many criteria that accomplish this goal use p the maximEq. um 3.47 likeliho od criterion, but determinan instead aimt to mak thehelements of are ossible. requires taking the determinant of W , ewhic which can be an h =ensiv W e xand indep endent from each other. Many Some criteriavariants that accomplish exp expensiv ensive numerically unstable op operation. eration. of ICA avthis oid goal this are p ossible. Eq. 3.47 requires taking the determinan t of , whic h can b e an W problematic op operation eration by constraining W to be orthonormal. expensive and numerically unstable operation. Some variants of ICA avoid this All variants of ICA require that p ( h) b e non-Gaussian. This is because if p(h) problematic operation by constraining W to be orthonormal. is an indep independent endent prior with Gaussian comp componen onen onents, ts, then W is not identifiable. p ( h p(h) All v ariants of ICA require that ) b e non-Gaussian. ThisofisWb.ecause We can obtain the same distribution ov over er p(x) for man many y values This isif very W is an indep endent Gaussian comp onents, then is not identifiable. differen different t from otherprior linearwith factor mo models dels lik like e probabilistic PCA and factor analysis, W e can obtain the same distribution ov er ) for man y v alues of . This p ( x W that often require p (h) to be Gaussian in order to make man many y op operations erations is onvery the differen t from otherform linear factor mo e probabilistic PCA and factorwhere analysis, mo model del hav have e closed solutions. Indels thelik maxim maximum um lik likeliho eliho elihoo od approach the p ( h that often require ) to b e Gaussian in order to make man y op erations d on the user explicitly sp specifies ecifies the distribution, a typical choice is to use p ( hi ) = dhi σ( hi). mo del hav e closed form solutions. In the maximum lik eliho od approach where the Typical choices of these non-Gaussian distributions ha hav ve larger peaks near 0 than p ( h ) = σ ( h user explicitly sp ecifies the distribution, a typical choice is to use ). do does es the Gaussian distribution, so we can also see most implementations of ICA Typical choices of features. these non-Gaussian distributions have larger peaks near 0 than as learning sparse does the Gaussian distribution, so we can also see most implementations of ICA as learning sparse features. 495
CHAPTER 13. LINEAR FACTOR MODELS
Man Many y variants of ICA are not generative mo models dels in the sense that we use the phrase. In this book, a generativ generativee mo model del either represen represents ts p(x) or can draw samples Man y v ariants of ICA are not generative mo dels in the bsense that we use the from it. Man Many y variants of ICA only know how to transform et etw ween x and h, but p ( x phrase. In this bowa ok,y aofgenerativ e model or ose can adraw samples p (heither do not hav have e any way representing ), andrepresen thus dotsnot )imp impose distribution from it. Man y v ariants of ICA only know how to transform b et w een and x h, but over p(x ). For example, many ICA varian ariants ts aim to increase the sample kurtosis of do=not impose a distribution −1 xe any way of representing p (h ), and thus h W hav , because high kurtosis indicates that p (h)dois not non-Gaussian, but this is over p(x ). For without example,explicitly many ICA variants aim to).increase sample kurtosis of p (h accomplished representing This is the because ICA is more h=W p (h) israther , because high kurtosis non-Gaussian, but this is often usedxas an analysis to tool ol for indicates separatingthat signals, than for generating p ( h accomplished without explicitly representing ) . This is b ecause ICA is more data or estimating its density density.. often used as an analysis tool for separating signals, rather than for generating as PCA can e generalized to the nonlinear auto autoenco enco encoders ders describ described ed in dataJust or estimating its bdensity . Chapter 14, ICA can be generalized to a nonlinear generativ generativee mo model, del, in which Just as PCA can b e generalized to the nonlinear auto enco ders ed in we use a nonlinear function f to generate the observed data. Seedescrib Hyvärinen Chapter 14 , ICA can b e generalized to a nonlinear generativ e mo del, in which and Pa Pajunen junen (1999) for the initial w work ork on nonlinear ICA and its successful we use nonlinear learning function bfy to generate observed data. SeeLappalainen Hyvärinen use witha ensemble Rob Roberts erts andtheEv Everson erson (2001 ) and andal.Pa junen (1999) for the initial work on and its et (2000 ). Another nonlinear extension of nonlinear ICA is theICA approach of successful nonline nonlinear ar use with ensemble learning b y Rob erts and Ev erson ( 2001 ) and Lappalainen indep independent endent comp omponents onents estimation estimation,, or NICE (Dinh et al., 2014), which stacks a et al. ( 2000 ). Another nonlinear ICA isthat the hav approach of nonline ar series of in inv vertible transformationsextension (enco (encoder derofstages) have e the prop property erty that indep endent comp estimation , or NICE (Dinh etcan al.,b2014 ), whichefficien stackstly a. the determinan determinant t ofonents the Jacobian of eac each h transformation e computed efficiently tly. seriesmakes of invertible transformations (enco der stages) thatand, havelik the property that This it possible to compute the likelihoo likelihood d exactly like e ICA, attempts thetransform determinan t ofdata the in Jacobian of eac h transformation can bmarginal e computed efficiently. to the into to a space where it has a factorized distribution, This makes it p ossible to compute the likelihoo d exactly and, lik e ICA, but is more likely to succeed thanks to the nonlinear enco encoder. der. Because theattempts enco encoder der to asso transform data intoder a space it has a inv factorized distribution, is associated ciated the with a deco decoder that where is its perfect inverse, erse, it marginal is straigh straightforwa tforwa tforward rd to but is more likely from to succeed thanks to the nonlinear theapplying encoder p (h)Because generate samples the mo model del (b (by y first sampling enco fromder. and then is asso ciated the deco decoder). der). with a decoder that is its perfect inverse, it is straightforward to generate samples from the model (by first sampling from p (h) and then applying Another generalization of ICA is to learn groups of features, with statistical the decoder). dep dependence endence allow allowed ed within a group but discouraged betw etween een groups (Hyvärinen Another generalization of ICA is to learn groups of features, with statistical and Hoy Hoyer er, 1999; Hyvärinen et al. al.,, 2001b). When the groups of related units are dep endence allow ed within a group but discouraged b etw een groups ( Hyvärinen chosen to be non-o non-ov verlapping, this is called indep independent endent subsp subspac ac acee analysis analysis.. It is also and Hoy er , 1999 ; Hyvärinen et al. , 2001b ). When the groups of related units are possible to assign spatial co coordinates ordinates to eac each h hidden unit and form ov overlapping erlapping chosen to be non-ovneighboring erlapping, this is called independentnearb subspyacunits e analysis . Itsimilar is also groups of spatially units. This encourages nearby to learn possible to assign spatialtoconatural ordinates to eacthis h hidden unit and overlapping features. When applied images, top topo ogr graphic aphic ICAform approac approach h learns groups of spatially neighboring units. This encourages nearb y units to learn similar Gab Gabor or filters, such that neighboring features hav havee similar orien orientation, tation, lo location cation or features. When applied to natural images, this top o gr aphic ICA approac h learns frequency frequency.. Man Many y different phase offsets of similar Gab Gabor or functions occur within Gab or filters, such that neighboring features hav e similar orientation, location or eac region, so that p o oling o v er small regions yields translation inv each h invariance. ariance. frequency. Many different phase offsets of similar Gabor functions occur within each region, so that pooling over small regions yields translation invariance.
13.3
Slo Slow w Feature Analysis
Slow featur atur ature e analysis (SF (SFA) A) is a linear factor mo model del that uses information from 13.3fe Slo w Feature Analysis Slow feature analysis (SFA) is a linear496 factor model that uses information from
CHAPTER 13. LINEAR FACTOR MODELS
time signals to learn inv invariant ariant features (Wiskott and Sejnowski, 2002). Slo Slow w feature analysis is motiv motivated ated by a general principle called the slowness time signals to learn invariant features (Wiskott and Sejnowski, 2002). principle. The idea is that the imp important ortant characteristics of scenes change very Slo w feature analysis is motiv ated by a general principle the slowness slo slowly wly compared to the individual measurements that make upcalled a description of a principle. The idea is that the imp ortant characteristics of scenes change very scene. For example, in computer vision, individual pixel values can change very slowly compared thees individual descriptionpixel of a rapidly rapidly. . If a zebratomov moves from leftmeasurements to righ rightt acrossthat the make image,upana individual scene. For example, in computer vision,and individual pixel alues can change will rapidly change from blac black k to white bac back k again as vthe zebra’s strip stripes es very pass rapidly . If a zebra mov es from left to righ t across the image, an individual pixel over the pixel. By comparison, the feature indicating whether a zebra is in the will rapidly change from to white and bac k again asthe thezebra’s zebra’spstrip es pass image will not change atblac all, kand the feature describing osition will ver theslowly pixel.. W Bye comparison, the feature indicating whether a zebra in the cohange slowly. We therefore may wish to regularize our mo model del to learn is features imagechange will not change all, and the feature describing the zebra’s position will that slo slowly wly ov over erattime. change slowly. We therefore may wish to regularize our model to learn features The slowness principle predates slow feature analysis and has been applied that change slowly over time. to a wide variety of mo models dels (Hinton, 1989; Földiák, 1989; Mobahi et al., 2009; The slowness principle slow analysis and has been applied Bergstra and Bengio , 2009). predates In general, we feature can apply the slowness principle to any to a wide v ariety of mo dels ( Hinton , 1989 ; Földiák , 1989 ; Mobahi et al. , 2009 differen differentiable tiable mo model del trained with gradien gradientt descent. The slowness principle ma may y be; Bergstra and Bengio , 2009 ). In general, we can apply the slowness principle to any in intro tro troduced duced by adding a term to the cost function of the form differentiable mo del trained with gradient descent. The slowness principle may be X t) the form introduced by adding a term function λ to the L(fcost (x (t+1) ), f (x(of )) (13.7) t
λ L(f (x ), f (x )) (13.7) where λ is a hyperparameter determining the strength of the slowness regularization term, t is the index into a time sequence of examples, f is the feature extractor λ is a hyperparameter where strength of slowness regularization to be regularized, and L is a determining loss functionthe measuring thethe distance betw etween een f( x(t)) t f term, is the index into a time sequence of examples, is the feature ( t +1) X for L is the mean squared difference. extractor and f (x ). A common choice to be regularized, and L is a loss function measuring the distance between f( x ) Slow a particularly efficient application of the slo slowness wness andSlo f (w x feature ). A analysis common is choice for L is theefficien mean tsquared difference. principle. It is efficien efficientt because it is applied to a linear feature extractor, and can Slo w feature analysis a particularly t application wness th thus us be trained in closed is form. Like some efficien varian ariants ts of ICA, SF SFA Aofisthe notslo quite a principle. eItmo isdel efficien because is applied a linear featuremap extractor, and can generativ generative model per tse, in the itsense that ittodefines a linear betw etween een input thus band e trained closedbut form. ts ov oferICA, SFAspace is notand quite a space featureinspace do does esLike not some definevaarian prior over feature thus generativ e mo del p er se, in the sense that it defines a linear map b etw een input do does es not imp impose ose a distribution p(x) on input space. space and feature space but does not define a prior over feature space and thus SFA A ose algorithm (Wisk Wiskott ottp(and Sejnowski , 2002) consists of defining f (x ; θ ) doesThe notSF imp a distribution x) on input space. to be a linear transformation, and solving the optimization problem The SFA algorithm (Wiskott and Sejnowski, 2002) consists of defining f (x ; θ ) to be a linear transformation, problem (13.8) − foptimization (x(t) ) i)2 min Eand x (t+1))ithe t (f (solving θ E (13.8) f (x ) ) min (f (x ) sub subject ject to the constraints −0 Etf (x(t)) i = (13.9) sub ject to the constraints E and f (x ) = 0 (13.9) (13.10) E t [f (x(t) )2i ] = 11.. and E [f (x497) ] = 1. (13.10)
CHAPTER 13. LINEAR FACTOR MODELS
The constraint that the learned feature hav havee zero mean is necessary to make the problem hav havee a unique solution; otherwise we could add a constant to all feature The constraint that the learned featurewith havequal e zero vmean necessary to ob make the v alues and obtain a different solution alue ofis the slo slowness wness objectiv jectiv jective. e. problem hav e a unique solution; otherwise we could add a constant to all feature The constrain constraintt that the features ha hav ve unit variance is necessary to prev preven en entt the v alues and obtain a different solution with equal v alue of the slo wness ob jective. pathological solution where all features collapse to 0. Like PCA, the SF SFA A features Theordered, constrain t that ve the unitslow variance necessary to features, prevent the are with the the firstfeatures feature bha eing slowest. est. Toislearn multiple we pathological solution where all features collapse to 0 . Like PCA, the SF A features must also add the constraint are ordered, with the first feature being the slowest. To learn multiple features, we must also add the constraint (13.11) ∀i < j, Et [f (x(t)) i f (x(t) )j ] = 00.. E [f (x m ) ust f (x be) linearly ] = 0. decorrelated from (13.11) i < j,features This sp specifies ecifies that the learned eac each h ∀ other. Without this constraint, all of the learned features would simply capture the Thisslo spwecifies that One the could learned features must be mechanisms, linearly decorrelated from each one slow est signal. imagine using other such as minimizing other. Without error, this constraint, the learned capture the reconstruction to force all theoffeatures to features div diversify ersify ersify,would , but simply this decorrelation one slo w est signal. One could imagine using other mechanisms, such as minimizing mec mechanism hanism admits a simple solution due to the linearity of SF SFA A features. The SF SFA A reconstruction error, to force the features to div ersify , but this decorrelation problem may be solved in closed form by a linear algebra pack package. age. mechanism admits a simple solution due to the linearity of SFA features. The SFA SF SFA A is typically used to learn nonlinear features by applying a nonlinear basis problem may be solved in closed form by a linear algebra package. expansion to x before running SF SFA. A. For example, it is common to replace x by the SF A is typically used to learn features btsy xapplying a nonlinear basis quadratic basis expansion, a vectornonlinear containing elemen elements i xj for all i and j. Linear x by the expansion to xmay before SF A. Fto orlearn example, is common replaceextractors SF thenrunning be comp deep itnonlinear slo SFA A mo modules dules composed osed slow wtofeature x for alla inonlinear quadratic basislearning expansion, a vector elements x and j. Linear b y rep repeatedly eatedly a linear SF SFA Acontaining feature extractor, applying basis SF A mo dules may then b e comp osed to learn deep nonlinear slo w feature extractors expansion to its output, and then learning another linear SF SFA A feature extractor on b y rep eatedly learning a linear SF A feature extractor, applying a nonlinear basis top of that expansion. expansion to its output, and then learning another linear SFA feature extractor on When trained on small spatial patc patches hes of videos of natural scenes, SF SFA A with top of that expansion. quadratic basis expansions learns features that share many characteristics with When trained cells on small spatial hes ofand videos of ott natural SFtrained A with those of complex in V1 cortexpatc (Berkes Wisk Wiskott , 2005scenes, ). When quadratic basis expansions learns features that share many characteristics with on videos of random motion within 3-D computer rendered en environmen vironmen vironments, ts, deep those of complex cells V1 cortex Berkes and Wisk ottthe , 2005 ). When trained SF SFA A learns features thatinshare man many y (characteristics with features represented on videos ofinrandom motion enet vironmen ts, b y neurons rat brains that within are used3-D forcomputer na navigation vigationrendered (Franzius al. al.,, 2007 ). deep SF SFA A SF A learns features that share man y characteristics with the features represented th thus us seems to be a reasonably biologically plausible mo model. del. by neurons in rat brains that are used for navigation (Franzius et al., 2007). SFA A ma major jor adv advan an antage tage of SF SFA A is that it is possibly to theoretically predict which thus seems to be a reasonably biologically plausible model. features SF SFA A will learn, ev even en in the deep, nonlinear setting. To mak makee suc such h theoretical A ma jor adv an tage of SF A is that it is p ossibly to theoretically predict which predictions, one must know ab about out the dynamics of the en environmen vironmen vironmentt in terms of features SF A will learn, ev en in the deep, nonlinear setting. T o mak e suc h theoretical configuration space (e.g., in the case of random motion in the 3-D rendered predictions, know ab out thepro dynamics of the environmen terms of en environmen vironmen vironment, t,one themust theoretical analysis proceeds ceeds from knowledge of thet in probability configuration space (e.g.,and in velocity the caseof of 3-D rendered distribution ov position therandom camera).motion Giv thethe knowledge of how over er Given en in en vironmen t, the theoretical analysis pro ceeds from knowledge of the probability the underlying factors actually change, it is possible to analytically solv solvee for the distribution ov er p osition and velocity of the camera). Giv en the knowledge of SF how optimal functions expressing these factors. In practice, exp experiments eriments with deep SFA A the underlying factors actually c hange, it is p ossible to analytically solv e for the applied to simulated data seem to reco recov ver the theoretically predicted functions. optimal functions expressing these factors. In practice, experiments with deep SFA applied to simulated data seem to reco498 ver the theoretically predicted functions.
CHAPTER 13. LINEAR FACTOR MODELS
This is in comparison to other learning algorithms where the cost function dep depends ends highly on sp specific ecific pixel values, making it muc much h more difficult to determine what This is inthe comparison tolearn. other learning algorithms where the cost function depends features mo model del will highly on specific pixel values, making it much more difficult to determine what Deepthe SF SFA Amo has een used to learn features for ob object ject recognition and pose features delalso willblearn. estimation (Franzius et al. al.,, 2008). So far, the slowness principle has not become has also of been to learn features ob ject recognition and pose the Deep basis SF forAan any y state the used art applications. It is for unclear what factor has limited estimation (Franzius eteculate al., 2008 ). So far, thethe slowness principle become its performance. We sp speculate that perhaps slowness prior is has to too o not strong, and the basis for an y state of the art applications. It is unclear what factor has limited that, rather than imp imposing osing a prior that features should b e appro approximately ximately constant, its p erformance. W e sp eculate that p erhaps the slowness prior is to strong,from and it would be better to imp impose ose a prior that features should be easy toopredict that,time rather impnext. osingThe a prior that of features should e appro ximately constant, one stepthan to the position an ob object ject is a buseful feature regardless of it would b e b etter to imp ose a prior that features should b e easy to predict from whether the ob object’s ject’s velocity is high or low, but the slo slowness wness principle encourages one time step to the next. The p osition of an ob ject is a useful feature the mo model del to ignore the position of ob objects jects that ha have ve high velocity velocity. . regardless of whether the ob ject’s velocity is high or low, but the slowness principle encourages the model to ignore the position of ob jects that have high velocity.
13.4
Sparse Co Coding ding
Sp Sparse arse cco oSparse ding (Olshausen and Field, 1996) is a linear factor mo model del that has 13.4 Co ding been heavily studied as an unsup unsupervised ervised feature learning and feature extraction Sp arse codingStrictly (Olshausen and the Field , 1996 ) is aco linear factor tomo delpro that mec mechanism. hanism. sp speaking, eaking, term “sparse coding” ding” refers the process cesshas of b een heavily studied as an unsup ervised feature learning and feature extraction inferring the value of h in this mo model, del, while “sparse mo modeling” deling” refers to the pro process cess mec hanism. Strictly sp eaking, the term “sparse co ding” refers to the pro cess of of designing and learning the mo model, del, but the term “sparse co coding” ding” is often used to h inferring the v alue of in this mo del, while “sparse mo deling” refers to the pro cess refer to both. of designing and learning the model, but the term “sparse coding” is often used to Lik Likee most other linear factor mo models, dels, it uses a linear deco decoder der plus noise to refer to both. obtain reconstructions of x, as sp specified ecified in Eq. 13.2. More sp specifically ecifically ecifically,, sparse Lik e most other linear factor mo dels, it uses a linear deco der plus noise to co coding ding mo models dels typically assume that the linear factors hav havee Gaussian noise with obtain reconstructions of , as sp ecified in Eq. 13.2 . More sp ecifically , sparse x isotropic precision β : coding models typically assume that the linear factors have Gaussian noise with 1 isotropic precision β : p(x | h) = N (x; W h + b, I ). (13.12) β 1 p(x h) = (x; W h + b, I ). (13.12) β The distribution p(h ) is chosen to be one with sharp peaks near 0 (Olshausen | N and Field, 1996). Common choices include factorized Laplace, Cauch Cauchy y or factorized The distribution ) is chosen to b e one with sharp p eaks near 0 (in Olshausen p ( h t Studen Studenttt-t distributions. For example, the Laplace prior parametrized terms of and sparsity Field, 1996 ). Common choices Laplace, Cauchy or factorized the penalty co coefficient efficient λ isinclude given bfactorized y Student-t distributions. For example, the Laplace prior parametrized in terms of 2 λ 1 the sparsity penalty co λ is given p(efficient h i) = Laplace( Laplace(h hi ; 0b,y ) = e− 2 λ|hi| (13.13) λ 4 2 λ p(hby) = Laplace(h ; 0, ) = e (13.13) and the Student-t prior λ 4 1 and the Student-t prior by p(h i) ∝ . (13.14) h 2 ν+1 (1 + 1νi ) 2 p(h ) . (13.14) 499 (1 + ) ∝
CHAPTER 13. LINEAR FACTOR MODELS
Training sparse co coding ding with maximum lik likeliho eliho elihoo od is in intractable. tractable. Instead, the training alternates betw etween een enco encoding ding the data and training the deco decoder der to better T raining sparse co ding with maximum lik eliho o d is in tractable. Instead, reconstruct the data given the enco encoding. ding. This approach will be justified furtherthe as training alternates b etw een enco ding the data and training the deco der to b etter a principled approximation to maximum likelihoo likelihood d later, in Sec. 19.3. reconstruct the data given the encoding. This approach will be justified further as For mo models dels suc such h as PCA, we hav havee seen the use of a parametric enco encoder der function a principled approximation to maximum likelihood later, in Sec. 19.3. that predicts h and consists only of multiplication by a weigh weightt matrix. The enco encoder der F or mo dels suc h as PCA, we hav e seen the use of a parametric enco der function that we use with sparse co coding ding is not a parametric enco encoder. der. Instead, the enco encoder der h that predicts and consists only of multiplication b y a weigh t matrix. The enco der is an optimization algorithm, that solves an optimization problem in which we seek thatsingle we use withlikely sparse coding is not a parametric encoder. Instead, the encoder the most co code de value: is an optimization algorithm, that solves an optimization problem in which we seek the single most likely codeh∗value: = f (x) = arg max p(h | x). (13.15) h
h = f (x) = arg max p(h x). (13.15) When com combined bined with Eq. 13.13 and Eq. 13.12, this yields the follo following wing optimization | problem: When combined with Eq. 13.13 and Eq. 13.12, this yields the following optimization problem: (13.16) arg max p(h | x) h
p(hp(h x)| x) (13.16) arg max max log = arg (13.17) h | = arg min maxλlog (13.17) ||hp||(1h+ βx||) x − W h||22 , (13.18) h | = arg min λ h + β x W h , (13.18) where we hav havee dropp dropped ed terms not dep depending ending on h and divided by positive scaling || − || factors to simplify the equation. || || h where we have dropped terms not dep ending on and divided by positive scaling Due to the imp imposition osition of an L1 norm on h, that this pro procedure cedure will yield a factors to simplify the equation. sparse h∗ (See Sec. 7.1.2). Due to the imposition of an L norm on h, that this procedure will yield a To train the mo model del rather than just perform inference, we alternate betw etween een sparse h (See Sec. 7.1.2). minimization with resp respect ect to h and minimization with resp respect ect to W . In this T o train the mo del rather than just p erform inference, w e alternate between β presen presentation, tation, we treat as a hyp yperparameter. erparameter. Typically it is set to 1 because its W .for minimization with respect to h and minimization with respisect Inbthis role in this optimization problem is shared with λ and there notoneed oth β as a hyp presen tation, we treat Typically is set to 1ofbthe ecause its β as a it h yp yperparameters. erparameters. In principle, weerparameter. could also treat parameter mo model del role in this optimization problem is shared with and there is no need for b oth λ and learn it. Our presentation here has discarded some terms that do not dep depend end β h yp erparameters. In principle, w e could also treat as a parameter of the mo del on h but do dep depend end on β . To learn β, these terms must be included, or β will and learntoit.0. Our presentation here has discarded some terms that do not depend collapse on h but do depend on β . To learn β, these terms must be included, or β will Not all approac approaches hes to sparse co coding ding explicitly build a p( h) and a p( x | h). collapse to 0. Often we are just interested in learning a dictionary of features with activ activation ation p ( h p ( Not all approac hes to sparse co ding explicitly build a ) and a x values that will often be zero when extracted using this inference pro procedure. cedure. h). Often we are just interested in learning a dictionary of features with activation | If we sample h from a Laplace prior, it is in fact a zero probability even eventt for values that will often be zero when extracted using this inference procedure. an element of h to actually be zero. The generative mo model del itself is not esp especially ecially h If we sample from a Laplace prior, it is in fact a zero probability even t for sparse, only the feature extractor is. Go Gooodfellow et al. (2013d) describ describee appro approximate ximate an element of h to actually be zero. The generative model itself is not especially sparse, only the feature extractor is. Goo500 dfellow et al. (2013d) describe approximate
CHAPTER 13. LINEAR FACTOR MODELS
inference in a different mo model del family family,, the spike and slab sparse co coding ding mo model, del, for whic which h samples from the prior usually contain true zeros. inference in a different model family, the spike and slab sparse coding model, for The sparse co coding ding approac approach h combined with the use of the non-parametric which samples from the prior usually contain true zeros. enco encoder der can in principle minimize the com combination bination of reconstruction error and The sparse co ding approac h combined the of theadv non-parametric log-prior better than any sp specific ecific parametricwith enco encoder. der. use Another advan an antage tage is that encoder cangeneralization in principle minimize theenco com bination of reconstruction errorlearn and there is no error to the encoder. der. A parametric enco encoder der must log-prior better sp parametric enco der. Another tage is that x tothan h inany x thatadv ho how w to map a wa way y ecific that generalizes. For unusual doan not resemble there is no generalization error to the enco der. parametric enco mustresults learn h that the training data, a learned, parametric enco encoder der A may fail to find ander x h x ho w to map to in a wa y that generalizes. F or unusual that do not resemble in accurate reconstruction or a sparse co code. de. For the vast ma majority jority of formulations thesparse training data,mo a dels, learned, parametric encoder may fail to find that results of co coding ding models, where the inference problem is conv convex, ex, an thehoptimization in accurate reconstruction a sparse code. For(unless the vastdegenerate ma jority ofcases formulations pro procedure cedure will alwa always ys findorthe optimal code such as of sparse co ding mo dels, where the inference problem is conv ex, the optimization replicated weigh weightt vectors occur). Ob Obviously viously viously,, the sparsit sparsity y and reconstruction costs pro cedure will alwa ys find the optimal code (unless degenerate cases as can still rise on unfamiliar poin oints, ts, but this is due to generalization errorsuch in the replicated weigh occur). Obviously, the sparsit y and reconstruction costs deco decoder der weigh weights, ts,t vectors rather than generalization error in the enco encoder. der. The lack of can still rise on unfamiliar p oin ts, but this is due to generalization error in the generalization error in sparse co coding’s ding’s optimization-based enco encoding ding pro process cess ma may y deco der weigh ts, rather than generalization error in the enco der. The lack of result in better generalization when sparse co coding ding is used as a feature extractor for generalization error in sparse co ding’s optimization-based encoding pro may a classifier than when a parametric function is used to predict the co code. de.cess Coates resultNg in (b2011 etter) generalization ding used as ageneralize feature extractor for and demonstratedwhen that sparse sparsecoco coding dingis features better for a classifier than when parametric function is used mo to del predict Coates ob object ject recognition tasksathan the features of a related model basedthe onco a de. parametric and Ng (the 2011 ) demonstrated that sparse codingbyfeatures generalize better enco encoder, der, linear-sigmoid auto autoenco enco encoder. der. Inspired their work, Go Goo odfello dfellow w et for al. ob ject recognition tasks than the features of a related mo del based on a parametric (2013d) sho show wed that a variant of sparse co coding ding generalizes better than other feature encoder, the autoextremely encoder. Inspired byare their work, Go oent dfello w few et al. extractors in linear-sigmoid the regime where few lab labels els available (tw (twent enty y or fewer er (lab 2013d ) sho w ed that a v ariant of sparse co ding generalizes b etter than other feature labels els per class). extractors in the regime where extremely few labels are available (twenty or fewer Theper primary disadvantage antage of the non-parametric enco encoder der is that it requires labels class). disadv greater time to compute h giv given en x because the non-parametric approach requires The primary disadv antage of theparametric non-parametric enco is that itdeveloped requires running an iterative algorithm. The auto autoenco enco encoder derder approach, h giv greater time 14 to, compute en x bn ecause non-parametric approach requires in Chapter uses only a fixed number umber the of la layers, yers, often only one. Another running an iterative The parametric enco der approach, developed disadv disadvan an antage tage is thatalgorithm. it is not straigh straight-forward t-forwardauto to bac back-propagate k-propagate through the in Chapter 14 , uses only a fixed n umber of la yers, often only one. Another non-parametric enco encoder, der, which mak makes es it difficult to pretrain a sparse co coding ding mo model del disadv tage is that criterion it is not and straigh t-forward k-propagate through the with ananunsup unsupervised ervised then fine-tunetoitbac using a sup supervised ervised criterion. non-parametric enco which makes it difficult to pretrain a sparse coding del Mo Modified dified versions ofder, sparse co coding ding that permit approximate deriv derivativ ativ atives es domo exist with an not unsup ervised criterion and then fine-tune it).using a supervised criterion. but are widely used (Bagnell and Bradley , 2009 Modified versions of sparse coding that permit approximate derivatives do exist Sparse co coding, ding, like other linear factor mo models, dels, often pro produces duces poor samples, but are not widely used (Bagnell and Bradley, 2009). as shown in Fig. 13.2. This happ happens ens even when the mo model del is able to reconstruct coding, likevide other linear factorfor moadels, often pro poisorthat samples, the Sparse data well and pro provide useful features classifier. Theduces reason each as shown in Fig. 13.2 . This happ ens even when the mo del is able to reconstruct individual feature may be learned well, but the factorial prior on the hidden co code de the data w ell and pro vide useful features for a classifier. The reason is that each results in the mo model del including random subsets of all of the features in eac each h generated individual feature may e learned well, tbut the factorial theimp hidden de sample. This motiv motivates ates bthe developmen development of deep deeper er mo models delsprior thatoncan impose ose a co nonresults in the model including random subsets of all of the features in each generated sample. This motivates the development of deeper models that can impose a non501
CHAPTER 13. LINEAR FACTOR MODELS
Figure 13.2: Example samples and weights from a spike and slab sparse co coding ding mo model del trained on the MNIST dataset. (L (Left) eft) The samples from the mo model del do not resemble the Figure 13.2: Example samples andone weights a spike and ding moThe del training examples. A Att first glance, might from assume the mo model del slab is posparse orly fit.co(Right) (L eft) trained on the MNIST dataset. The samples from the mo del do not resemble the weigh eightt vectors of the mo model del hav havee learned to represent penstrokes and sometimes complete (Right)prior trainingThe examples. At th first might assume theproblem model isis pthat oorly fit.factorial The digits. mo model del has thus us glance, learned one useful features. The the w eighfeatures t vectors results of the mo have learned enstrokes and sometimes over in del random subsets to of represent features bpeing combined. Few suchcomplete subsets digits. The model thus learned useful features. TheThis problem isates thatthe thedevelopmen factorial prior are appropriate tohas form a recognizable MNIST digit. motiv motivates development t of o ver features results in random subsets of features b eing combined. F ew such generativ generativee mo models dels that hav havee more pow owerful erful distributions ov over er their latent co codes. des. subsets Figure are appropriate form a recognizable MNIST This ).motivates the development of repro reproduced duced withtopermission from Go Goo odfello dfellow w etdigit. al. (2013d generative models that have more powerful distributions over their latent codes. Figure reproduced with permission from Goodfellow et al. (2013d).
factorial distribution on the deep deepest est co code de lay layer, er, as well as the developmen developmentt of more sophisticated shallow mo models. dels. factorial distribution on the deepest code layer, as well as the development of more sophisticated shallow models.
13.5
Manifold Interpretation of PCA
Linear mo models dels including PCA and factor analysis can be interpreted as 13.5 factor Manifold Interpretation of PCA learning a manifold (Hinton et al., 1997). We can view probabilistic PCA as Linear factor dels e-shap including PCA ofand factor analysis be interpreted as defining a thinmo pancak pancake-shap e-shaped ed region high probabilit probability—a y—acan Gaussian distribution learning a manifold (Hinton al., just 1997as). a W e can view as that is very narro narrow w along someetaxes, pancake is veryprobabilistic flat along itsPCA ver vertical tical defining a thin pancak e-shap ed region of high probabilit y—a Gaussian distribution axis, but is elongated along other axes, just as a pancake is wide along its horizontal that is This very is narro w along some axes, just as a pancake is very flat along its vertical axes. illustrated in Fig. 13.3 . PCA can be interpreted as aligning this axis, but is elongated along other axes, just as a pancake is wide along its horizontal pancak pancakee with a linear manifold in a higher-dimensional space. This interpretation axes. is illustrated in Fig. 13.3 . PCA interpreted asder aligning this appliesThis not just to traditional PCA but also tocan anybelinear auto autoenco enco encoder that learns pancak e with a linear manifold in a higher-dimensional space. This interpretation matrices W and V with the goal of making the reconstruction of x lie as close to applies not just to traditional PCA but also to any linear autoencoder that learns x as possible, matrices W and V with the goal of making the reconstruction of x lie as close to the enco encoder der be x asLet possible, (13.19) h = f (x) = W > (x − µ). Let the encoder be (13.19) h = f (x) =502 W (x µ). −
CHAPTER 13. LINEAR FACTOR MODELS
The enco encoder der computes a low-dimensional represen representation tation of h. With the auto autoenco enco encoder der view, we hav havee a deco decoder der computing the reconstruction The encoder computes a low-dimensional representation of h. With the autoenco der view, we have a decoder computing reconstruction xˆ = g(the h) = b + V h. (13.20) xˆ = g(h) = b + V h.
(13.20)
Figure 13.3: Flat Gaussian capturing probability concentration near a low-dimensional manifold. The figure shows the upp upper er half of the “pancake” ab abo ove the “manifold plane” Figure 13.3: Flat Gaussian capturing probability concentration near atolow-dimensional whic which h go goes es through its middle. The variance in the direction orthogonal the manifold is manifold. figure shows the er half ove the “manifold v ery small The (arrow poin ointing ting out of upp plane) and of canthe be “pancake” considered ab like “noise,” while theplane” other whic h goesare through middle. arianceand in the direction orthogonal thea manifold is variances large its (arrows in The the v plane) corresp correspond ond to “signal,” to and co coordinate ordinate v ery small (arrow p oin ting out of plane) and can b e considered like “noise,” while the other system for the reduced-dimension data. variances are large (arrows in the plane) and corresp ond to “signal,” and a co ordinate system the reduced-dimension Thefor choices of linear enco encoder derdata. and deco decoder der that minimize reconstruction error
The choices of linear encoder and deco error ˆder E[||x −x || 2]that minimize reconstruction (13.21) E ˆ columns x the x ] (13.21) corresp correspond ond to V = W , µ = b = E[ x[] and of W form an orthonormal − principal || basis which spans the same subspace eigen eigenv vectors of the cov covariance ariance E ||as the V = W µ = b = [ x W corresp ond to , ] and the columns of form an orthonormal matrix basis which spans the same subspace vectors of the cov(13.22) ariance . C = E[(xas−the µ)(principal x − µ)>]eigen matrix E In the case of PCA, the columns eigenv vectors, ordered (13.22) by the ]. C = [(of x Wµare )(x these µ) eigen magnitude of the corresp corresponding onding eigenv eigenvalues alues (which are all real and non-negative). W are − In the case of PCA, the columns of − these eigenvectors, ordered by the One can also sho show w that eigen eigenv value λ i of C corresp corresponds onds to the variance of x magnitude of the corresponding eigenv alues (which are alld real and non-negative). ( i ) D in the direction of eigen eigenv vector v . If x ∈ R and h ∈ R with d < D, then the One can also show that eigenvalue λ of C corresponds to the variance of x 503 R and h R with d < D, then the in the direction of eigenvector v . If x ∈ ∈
CHAPTER 13. LINEAR FACTOR MODELS
optimal reconstruction error (choosing µ, b, V and W as ab abo ove) is D optimal reconstruction error (choosing µ, b, V X and W as above) is ˆ ||2 ] = λi. min E[||x − x (13.23) i=d+1 E ˆ ]= λ. min [ x x (13.23) Hence, if the co cov variance has rank||d , − the||eigenv eigenvalues alues λ d+1 to λ D are 0 and reconstruction error is 0. Hence, if the covariance has rank d , the eigenvalues λ to λ are 0 and reconFurthermore, one can also show that the ab abov ov ovee solution can be obtained by struction error is 0. X maximizing the variances of the elemen elements ts of h, under orthonormal W , instead of Furthermore, one can also show that the above solution can be obtained by minimizing reconstruction error. maximizing the variances of the elements of h, under orthonormal W , instead of Linear factor mo models dels areerror. some of the simplest generative mo models dels and some of the minimizing reconstruction simplest mo models dels that learn a representation of data. Muc Much h as linear classifiers and Linear factor mo models dels ma areysome of the simplest mo delsorks, and these some of the linear regression models may be extended to deep generative feedforw feedforward ard netw networks, linear simplest mo dels that a representation ofder data. Muc h as and factor mo models dels ma may y belearn extended to auto autoenco enco encoder netw networks orks andlinear deepclassifiers probabilistic linear regression mo dels ma y b e extended to deep feedforw ard netw orks, these linear mo models dels that perform the same tasks but with a muc much h more pow owerful erful and flexible factor dels. may be extended to autoencoder networks and deep probabilistic mo model del mo family family. models that perform the same tasks but with a much more powerful and flexible model family.
504
Chapter 14 Chapter 14
Auto Autoenco enco encoders ders Auto enco ders An auto autoenc enc enco oder is a neural net netw work that is trained to attempt to cop copy y its input to its output. In Internally ternally ternally,, it has a hidden lay layer er h that describ describes es a code used to An auto enc o der is a neural net w ork that is trained attempt to its input represen representt the input. The net netw work ma may y be view viewed ed astoconsisting of cop tw two oyparts: an h to its output. In ternally , it has a hidden lay er that describ es a c o de used enco encoder der function h = f (x) and a deco decoder der that pro produces duces a reconstruction r = g (hto ). represen t the input. The net w ork ma y b e view ed as consisting of tw o parts: an This arc architecture hitecture is presen presented ted in Fig. 14.1. If an auto autoenco enco encoder der succeeds in simply h = f ( x = g (h). enco der function ) and a deco der that pro duces a reconstruction learning to set g (f (x)) = x ev everywhere, erywhere, then it is not esp especially ecially useful.r Instead, This architecture is presento tedbeinunable Fig. 14.1 . If an der succeeds simply auto autoenco enco encoders ders are designed to learn to auto cop copy yenco perfectly erfectly. . Usuallyinthey are g ( f ( x )) = x learning to set ev erywhere, then it is not esp ecially useful. Instead, restricted in wa ways ys that allo allow w them to cop copy y only approximately approximately,, and to copy only auto enco ders are designed to b e unable to learn to copmo y pdel erfectly . Usually they are input that resembles the training data. Because the model is forced to prioritize restricted in wa ys that allo w them to cop y only approximately , and to copy whic which h asp aspects ects of the input should be copied, it often learns useful prop properties erties ofonly the input that resembles the training data. Because the model is forced to prioritize data. which aspects of the input should be copied, it often learns useful properties of the Mo Modern dern auto autoenco enco encoders ders hav havee generalized the idea of an enco encoder der and a dedata. co coder der beyond deterministic functions to sto stochastic chastic mappings pencoder (h | x) and Modern pdecoder (x | hauto ). encoders have generalized the idea of an encoder and a de(h x) and coder beyond deterministic functions to stochastic mappings p The idea of auto autoenco enco encoders ders has b een part of the historical landscape of neural p (x h). | net networks works for decades (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, | of autoencoders has been part of the historical landscape of neural The idea 1994 1994). ). T raditionally raditionally, , auto autoenco enco encoders ders w were ere used for dimensionalit dimensionality y reduction or net works for decades ( LeCun , 1987 ; Bourlard and Kamp , 1988 ; Hinton and Zemel feature learning. Recen Recently tly tly,, theoretical connections betw between een auto autoenco enco encoders ders and, 1994 ). Traditionally , auto ders auto wereenco usedders fortodimensionalit reduction or laten latent t variable models hav haveeenco brought autoenco encoders the forefronty of generative feature learning. Recen tlyChapter , theoretical connections autoenco and mo modeling, deling, as we will see in 20. Auto Autoenco enco encoders ders betw ma may y een be thought ofders as being laten t v ariable models hav e brought auto enco ders to the forefront of generative a special case of feedforward netw networks, orks, and ma may y be trained with all of the same mo deling, as w e will see in Chapter 20 . Auto enco may begradients thought of as being tec techniques, hniques, typically minibatc minibatch h gradient descentders following computed ayspecial case of feedforward orks,feedforw and maard y benetw trained all of ders the same b bac back-propagation. k-propagation. Unlik Unlikeenetw general feedforward networks, orks,with auto autoenco enco encoders ma may y tec hniques, t ypically minibatc h gradient descent following gradients computed also be trained using recir circulation culation (Hinton and McClelland, 1988), a learning b y bac k-propagation. Unlik general feedforw orks, dersinput may algorithm based on comparinge the activ activations ations ofard the netw netw network ork onauto the enco original also be trained using recirculation (Hinton and McClelland, 1988), a learning algorithm based on comparing the activ505 ations of the network on the original input 505
CHAPTER 14. AUTOENCODERS
to the activ activations ations on the reconstructed input. Recirculation is regarded as more biologically plausible than back-propagation, but is rarely used for mac machine hine learning to the activ ations on the reconstructed input. Recirculation is regarded as more applications. biologically plausible than back-propagation, but is rarely used for machine learning applications. h f x
g r
Figure 14.1: The general structure an auto autoencoder, encoder, mapping an input x to an output (called reconstruction) r through an internal representation or co code de h . The auto autoenco enco encoder der Figure The general structure an auto encoder, an input to an output has tw two o14.1: comp components: onents: the enco encoder der f (mapping and the deco decoder der g x(mapping x to h)mapping h to (called reconstruction) r through an internal representation or code h . The autoencoder r). has two components: the encoder f (mapping x to h) and the decoder g (mapping h to r).
14.1
Undercomplete Auto Autoenco enco encoders ders
Cop Copying ying Undercomplete the input to the output ma may y enco soundders useless, but we are typically not 14.1 Auto in interested terested in the output of the decoder. Instead, we hop hopee that training the Cop ying the input to the output ma y sound useless, but weh are typically not auto to p erform the input cop task will result in taking on useful autoenco enco encoder der copying ying in terested prop properties. erties. in the output of the decoder. Instead, we hope that training the autoencoder to perform the input copying task will result in h taking on useful One way to obtain useful features from the auto autoenco enco encoder der is to constrain h to properties. ha hav ve smaller dimension than x. An auto autoenco enco encoder der whose co code de dimension is less ay to obtain useful features from the auto enco der isan to undercomplete constrain h to thanOne thewinput dimension is called under omplete Learning underccomplete.. x. An ha ve smaller encoderthe whose de dimension represen representation tationdimension forces the than auto autoenco enco encoder derauto to capture mostcosalient features is of less the than the input dimension is called under c omplete . Learning an undercomplete training data. representation forces the autoencoder to capture the most salient features of the The learning process is describ described ed simply as minimizing a loss function training data. (xsimply , g(f (x))) The learning process is describLed as minimizing a loss function (14.1) where L is a loss function penalizing for being dissimilar from x, suc such h as L(xg,(gf((fx()) x))) (14.1) the mean squared error. where L is a loss function penalizing g (f (x)) for being dissimilar from x, such as When the deco decoder der is linear and L is the mean squared error, an undercomplete the mean squared error. auto autoenco enco encoder der learns to span the same subspace as PCA. In this case, an auto autoenco enco encoder der L When the deco der is linear and is the mean squared error, an undercomplete trained to perform the copying task has learned the principal subspace of the autoencodata der learns to span the same subspace as PCA. In this case, an autoencoder training as a side-effect. trained to perform the copying task has learned the principal subspace of the Auto Autoenco enco encoders ders with nonlinear encoder functions f and nonlinear decoder functraining data as a side-effect. tions g can th thus us learn a more powerful nonlinear generalization of PCA. UnfortuAutoencoders with nonlinear encoder functions f and nonlinear decoder functions g can thus learn a more powerful 506 nonlinear generalization of PCA. Unfortu-
CHAPTER 14. AUTOENCODERS
nately nately,, if the enco encoder der and deco decoder der are allo allowed wed to too o muc much h capacity capacity,, the auto autoenco enco encoder der can learn to perform the cop copying ying task without extracting useful information ab about out nately , if the enco deco der are allo,wed o muc h capacity , the encoder der the distribution ofder theand data. Theoretically Theoretically, oneto could imagine that an auto autoenco autoencoder can learn to perform theco cop extracting information about with a one-dimensional code de ying but atask verywithout powerful nonlinearuseful enco encoder der could learn to the distribution of the data. Theoretically , one could imagine that an autoenco der ( i ) represen representt each training example x with the co code de i . The deco decoder der could learn to with one-dimensional code butto a vthe eryvpalues owerful nonlinear encoderexamples. could learn to map athese in integer teger indices back of sp specific ecific training This i . The deco represen t each training with the derthat could to sp specific ecific scenario do does es notexample occur inxpractice, but co it de illustrates clearly anlearn auto autoenenmap integer indicesthe back to thetask values spto ecific training examples. This co coder derthese trained to perform copying canoffail learn an anything ything useful ab about out sp ecific scenario do es not o ccur in practice, but it illustrates clearly that an auto the dataset if the capacity of the auto autoenco enco encoder der is allo allowed wed to become to too o great. encoder trained to perform the copying task can fail to learn anything useful about the dataset if the capacity of the autoencoder is allowed to become too great.
14.2
Regularized Auto Autoenco enco encoders ders
Undercomplete auto autoenco enco encoders, ders, with enco co code de dimension 14.2 Regularized Auto ders less than the input dimension, can learn the most salien salientt features of the data distribution. W Wee ha have ve seen that Undercomplete autoenco ders, with co de dimension the and inputdecoder dimension, these autoenco autoencoders ders fail to learn an anything ything useful if less the than encoder are can learn the most salien t features of the data distribution. W e ha ve seen that giv given en to too o muc much h capacit capacity y. these autoencoders fail to learn anything useful if the encoder and decoder are A similar problem occurs if the hidden co code de is allow allowed ed to hav havee dimension equal given too much capacity. to the input, and in the over overccomplete case in which the hidden code has dimension A similar problem occurs if thecases, hidden code is allowenco ed to hav e dimension equal greater than the input. In these even a linear encoder der and linear deco decoder der to the input, and in the over c omplete case in which the hidden code has dimension can learn to cop copy y the input to the output without learning an anything ything useful ab about out greater than the input. In these cases, even a linear enco der and linear deco der the data distribution. can learn to copy the input to the output without learning anything useful about Ideally Ideally,, one could train an any y arc architecture hitecture of auto autoenco enco encoder der successfully successfully,, cho hoosing osing the data distribution. the co code de dimension and the capacity of the enco encoder der and deco decoder der based on the Ideallyy, one could train to anybearc hitecture of autoencoauto der successfully , cvide hoosing complexit complexity of distribution mo modeled. deled. Regularized autoenco enco encoders ders pro provide the the co de dimension and the capacity of the enco der and deco der based on the abilit ability y to do so. Rather than limiting the model capacity by keeping the enco encoder der complexit y of distribution to be mo deled. Regularized auto enco ders pro vide the and decoder shallow and the code size small, regularized autoenco autoencoders ders use a loss ability tothat do so. Rather than model capacity by keeping der function encourages the limiting mo model del tothe ha have ve other prop properties erties besidesthe theenco abilit ability y and decoder shallow and the code size small, regularized autoenco ders use a loss to copy its input to its output. These other prop properties erties include sparsity of the function that encourages the mo del to ha ve other properties besides the ability represen representation, tation, smallness of the deriv derivativ ativ ativee of the representation, and robustness to copy its input to its output. These other prop erties include sparsity of and the to noise or to missing inputs. A regularized auto autoenco enco encoder der can be nonlinear tation, smallness of something the derivativ e of ab theout representation, and robustness orepresen vercomplete but still learn useful about the data distribution ev even en if to noise or to missing inputs. A regularized auto enco der can b e nonlinear and the mo model del capacit capacity y is great enough to learn a trivial identit identity y function. overcomplete but still learn something useful about the data distribution even if In addition to the metho methods ds describ described ed here which are most naturally in interpreted terpreted the model capacity is great enough to learn a trivial identity function. as regularized auto autoenco enco encoders, ders, nearly an any y generative mo model del with latent variables addition to the ds describ ed here which are mostlatent naturally interpreted andIn equipp equipped ed with anmetho inference pro procedure cedure (for computing representations as regularized auto ders,asnearly any generative model with variables giv given en input) may beenco viewed a particular form of auto autoenco enco encoder. der. latent Two generative and equipp ed with an inference pro cedure (for computing latent representations mo modeling deling approac approaches hes that emphasize this connection with autoenco autoencoders ders are the giv en input) may b e viewed as a particular form of auto enco der. T wo descendan descendants ts of the Helmholtz mac machine hine (Hinton et al., 1995b), such as the vgenerative ariational modeling approaches that emphasize this connection with autoencoders are the 507 descendants of the Helmholtz machine (Hinton et al., 1995b), such as the variational
CHAPTER 14. AUTOENCODERS
auto autoenco enco encoder der (Sec. 20.10.3) and the generativ generativee sto stochastic chastic net networks works (Sec. 20.12). These models naturally learn high-capacit high-capacity y, ov overcomplete ercomplete enco encodings dings of the input auto enco der (Sec. 20.10.3 ) and the generativ e sto chastic net works (Sec. 20.12 and do not require regularization for these enco encodings dings to be useful. Their enco encodings dings). These models useful naturally learnthe high-capacit y, ovtrained ercomplete encoximately dings of maximize the input are naturally because mo models dels were to appro approximately and do not require regularization for these enco dings to b e useful. Their enco dings the probabilit probability y of the training data rather than to copy the input to the output. are naturally useful because the models were trained to approximately maximize the probability of the training data rather than to copy the input to the output.
14.2.1
Sparse Auto Autoenco enco encoders ders
14.2.1 Sparse Auto encoders A sparse auto autoenco enco encoder der is simply an auto autoenco enco encoder der whose training criterion in involv volv volves es a h) on the co sparsit sparsity y penalty Ω( Ω(h code de la layer yer h, in addition to the reconstruction error: A sparse autoencoder is simply an autoencoder whose training criterion involves a L(de x, la g(yer f (xh))) Ω(h) to the reconstruction(14.2) sparsity penalty Ω(h) on the co , in+addition error: where g (h) is the decoder output havee h = f (x ), the enco encoder der L(x, gand (f (xtypically ))) + Ω(hw)e hav (14.2) output. where g (h) is the decoder output and typically we have h = f (x ), the encoder Sparse auto autoenco enco encoders ders are typically used to learn features for another task such output. as classification. An auto autoenco enco encoder der that has been regularized to be sparse must Sparse auto enco ders are t ypically used learn features for another taskrather such resp respond ond to unique statistical features of thetodataset it has been trained on, as classification. encoder that has In been to bto e sparse ust than simply actingAn asauto an identit identity y function. thisregularized way, training performmthe resp ond to unique statistical features of the dataset it has been trained on, rather cop copying ying task with a sparsity penalty can yield a model that has learned useful than simply as an identity function. In this way, training to perform the features as a acting byproduct. copying task with a sparsity penalty can yield a model that has learned useful h) simply as a regularizer term added to We can think of the penalty Ω( Ω(h features as a byproduct. a feedforw feedforward ard netw network ork whose primary task is to cop copy y the input to the output h W e can think of the penalty Ω( ) simply as a regularizer addedtask to (unsup (unsupervised ervised learning ob objective) jective) and possibly also perform some term sup supervised ervised a feedforw ard network whoseob primary toends copy on thethese inputsparse to thefeatures. output (with a supervised learning that isdep objective) jective) task depends (unsup learning ob jective) possibly also perform some supervised task Unlik Unlikee ervised other regularizers such asand weigh eight t decay decay, , there is not a straightforw straightforward ard (with a supervised learning ob jective) that dep ends on these sparse features. Ba Bay yesian interpretation to this regularizer. As describ described ed in Sec. 5.6.1, training Unlik e other regularizers such as w eigh t decay , there is not straightforw with weigh weightt deca decay y and other regularization penalties can beain interpreted terpreted asard a Ba yesian interpretation to this regularizer. As describ ed in Sec. 5.6.1 , training MAP appro to Ba inference, with the added regularizing penalty approximation ximation Bayesian yesian with weigh t deca and other regularization penalties canmo bedelinparameters. terpreted as In a corresp corresponding onding to ya prior probabilit probability y distribution over the model MAPview, appro ximation tomaxim Bayesian inference, with the added regularizingp penalty (θ | x ), this regularized maximum um lik likeliho eliho elihoo od corresponds to maximizing corresp onding to a prior probabilit y distribution o ver the mo del parameters. In whic which h is equiv equivalent alent to maximizing log p( x | θ) + log p(θ). The log p( x | θ) term p (θ ov xer), this view, regularized maximum likeliho odthe corresponds to maximizing log p(θ) term, is the usual data log-likelihoo log-likelihood d term and the log-prior over log p( xover + log p(θ).values θ) particular θ) term whic h is equiv alent to maximizing The log | parameters, incorporates the preference of pθ(.xThis view log p ( θ is the usual data log-likelihoo d term and the ) term, the log-prior over | | was described in Sec. 5.6. Regularized auto autoenco enco encoders ders defy suc such h an in interpretation terpretation θ parameters, incorporates the preference o ver particular v alues of . This view because the regularizer dep depends ends on the data and is therefore by definition not a w as described in Sec. 5.6 . Regularized auto encostill dersthink defy of sucthese h an in terpretation prior in the formal sense of the word. W e can regularization because regularizer depends on the data and is therefore by definition not a terms as the implicitly expressing a preference ov over er functions. prior in the formal sense of the word. We can still think of these regularization Rather than thinking of the sparsity penalty as a regularizer for the copying terms as implicitly expressing a preference over functions. task, we can think of the entire sparse auto autoenco enco encoder der framew framework ork as approximating Rather than thinking of the sparsity penalty as a regularizer for the copying task, we can think of the entire sparse 508 autoencoder framework as approximating
CHAPTER 14. AUTOENCODERS
maxim maximum um lik likeliho eliho elihoo od training of a generative mo model del that has laten latentt v variables. ariables. Supp Suppose ose we hav havee a mo model del with visible variables x and laten latentt variables h, with maxim um lik eliho od trainingp of a (generative model that has latent variables. an explicit joint distribution model x, h ) = p model (h)p model( x | h). We refer to x and h, with ose we havemodel’s a del’s model with visible variables t variables pSupp prior distribution ov over er the laten latentlaten t variables, represen representing ting model (h) as the mo an explicit joint distribution ) . W e refer to x h p ( x , h ) = p ( h ) p ( the mo model’s del’s beliefs prior to seeing x. This is different from the wa way y we ha have ve p (h) asused the the model’s distribution over latent variables, represen p|(θ ) enco previously wordprior “prior,” to refer to thethe distribution encoding ding ting our x the mo del’s b eliefs prior to seeing . This is different from the wa y we ha ve beliefs ab about out the mo model’s del’s parameters before we hav havee seen the training data. The previously used “prior,” to refer to the distribution p (θ ) encoding our log-lik log-likeliho eliho elihood od canthe beword decomp decomposed osed as beliefs about the model’s parameters before we have seen the training data. The X log-likeliho od can be decomp osed log p model (xas ) = log pmodel (h, x). (14.3) h
log p (x) = log p (h, x). (14.3) We can think of the autoenco autoencoder der as approximating this sum with a point estimate for just one highly likely value for h. This is similar to the sparse co coding ding generative W e can think of the autoenco der as approximating this sum with a point estimate mo model del (Sec. 13.4), but with h being the output of the parametric encoder rather h for just one highly likely v alue for . This is similar to the sparse co ding generative X than the result of an optimization that infers the most likely h. From this point of modelwith (Sec.this 13.4 ), but hwith being the output of the parametric encoder rather view, chosen , we hare maximizing than the result of an optimization that infers the most likely h. From this point of view, with this chosen h,(we log p model h, xare ) =maximizing log p model (h) + log pmodel (x | h). (14.4)
(h,bxe) sparsity-inducing. p = log p (h) + log (x hthe ). Laplace (14.4) The log p model (hlog ) term can Forpexample, prior, | λ −λ|hiF| or example, the Laplace prior, The log p (h) term can be sparsity-inducing. pmodel(hi ) = e , (14.5) 2 λ p ( h ) = e corresp corresponds onds to an absolute value sparsity p2enalty enalty. . ,Expressing the log-prior(14.5) as an absolute value penalty enalty,, we obtain corresponds to an absolute value sparsity penalty. Expressing the log-prior as an X absolute value penalty, Ω( wehobtain )=λ |h i | (14.6) i
Ω(h) = X λ h (14.6) λ − log pmodel (h) = λ | h | − log = Ω( h ) + const (14.7) i | | 2 i λ log p λh log (h) = = Ω(h) + const (14.7) 2not h. We typically treat λ as a where the constan constant t term dep depends ends only on and λ X | |− − hyp yperparameter erparameter and discard the constan constantt term since it do does es not affect the parameter where the constan t term dep ends only on and not . We induce typically treat .λFas a λ h learning. Other priors suc such h as the Studen Studenttt-tt prior can also sparsity sparsity. rom hyp erparameter discard as theresulting constantfrom termthe since it do not affect parameter X pmodel (h) onthe this poin ointt of viewand of sparsity effect ofes approximate learning. priors such as the the sparsity Student-pt enalty prior can alsoa induce sparsityterm . From maxim maximum umOther likelihoo likelihood d learning, is not regularization at p ( h this p oin t of view of sparsity as resulting from the effect of ) on approximate all. It is just a consequence of the model’s distribution over its laten latentt variables. maxim um likelihoo d learning, the sparsity p enalty is not a regularization This view pro provides vides a different motiv motivation ation for training an auto autoenco enco encoder: der: it isterm a waat y all. It is just a consequence of the model’s distribution o ver its laten t v ariables. of approximately training a generative mo model. del. It also provides a different reason for This view provides a different motivation for training an autoencoder: it is a way of approximately training a generative mo del. It also provides a different reason for 509
CHAPTER 14. AUTOENCODERS
wh why y the features learned by the auto autoenco enco encoder der are useful: they describ describee the latent variables that explain the input. why the features learned by the autoencoder are useful: they describe the latent Early work on sparse auto autoenco enco encoders ders (Ranzato et al., 2007a, 2008) explored variables that explain the input. various forms of sparsit sparsity y and prop proposed osed a connection bet etw ween the sparsit sparsity y penalty Early work on sparse auto enco ders ( Ranzato et al. , 2007a , 2008 ) explored and the log Z term that arises when applying maximum lik likeliho eliho elihoood to an undirected vprobabilistic arious forms mo of del sparsit connection etween the log sparsit y pen enalty p ( xy) and = Z1 prop p˜(x )osed Z prev model . Thea idea is that bminimizing preven ents ts a log Z and the term that arises when applying maximum lik eliho o d to an undirected probabilistic mo model del from ha having ving high probability everywhere, and imp imposing osing sparsit sparsity y p ( x ) = p ˜ ( x log Z probabilistic mo del ) . The idea is that minimizing prev en ts a on an auto autoenco enco encoder der prev prevents ents the autoenco autoencoder der from ha having ving lo low w reconstruction probabilistic mo del from ha ving high probability everywhere, and sparsitye error everywhere. In this case, the connection is on the lev level elimp ofosing an intuitiv intuitive on an autoenco prevents the autoenco having low correspondence. reconstruction understanding ofder a general mec mechanism hanism ratherder thanfrom a mathematical errorinterpretation everywhere. of Inthe thissparsit case,y the connection is on the lev el of an intuitiv p model (h) in ae The sparsity penalty as corresponding to log understanding a general mechanism rather a mathematical correspondence. (x | h) is morethan mathematically straightforw straightforward. ard. (h)p model directed mo model delofpmodel (h) in a The interpretation of the sparsity penalty as corresponding to log p h for sparse (and denoising) auto One way to achiev achieve zer zeros autoenco enco encoders ders (x oshin ) is more mathematically straightforw ard. (ehactual )p directed model p was in intro tro troduced duced in Glorot et al. (2011b). The idea is to use rectified linear units to | in h for sparse (and denoising) autoencoders One w ay co tode achiev e actual os pro produce duce the code lay layer. er. With azer prior that actually pushes the representations to w as in tro duced in Glorot et al. ( 2011b ). The idea is to use rectified linear to zero (lik (likee the absolute value penalty), one can thus indirectly control the units av average erage pro duce of thezeros codeinlay er.represen With atation. prior that actually pushes the representations to n umber the representation. zero (like the absolute value penalty), one can thus indirectly control the average number of zeros in the representation.
14.2.2
Denoising Auto Autoenco enco encoders ders
Rather adding a penalty Ω to the cost function, we can obtain an auto autoenco enco encoder der 14.2.2 than Denoising Autoenco ders that learns something useful by changing the reconstruction error term of the cost Rather than adding a penalty Ω to the cost function, we can obtain an autoencoder function. that learns something useful by changing the reconstruction error term of the cost Traditionally raditionally,, auto autoenco enco encoders ders minimize some function function. Traditionally, autoencoders minimize L(x, g(fsome (x)))function L(xg,(gf((fx()) x))) where L is a loss function penalizing for being dissimilar from 2 the L norm of their difference. This encourages g ◦ f to learn to be L function g (f (yx)) where is a loss function penalizing iden identity tity if they hav have e the capacit capacity to for do being so. dissimilar from the L norm of their difference. This encourages g f to learn to be A denoising auto autoenc enc enco oder or DAE instead minimizes identity function if they have the capacity to do so. ◦ A denoising autoencoder or DAE minimizes ˜ ))) L(xinstead , g(f (x ))), ,
(14.8) (14.8) x, suc such h as merely an x, such as merely an (14.9)
˜ ))), by some form of noise. Denoising L(x,corrupted g(f (x (14.9) ˜ is a cop where x copy y of x that has been auto autoenco enco encoders ders must therefore undo this corruption rather than simply cop copying ying their ˜ is a copy of x that has been corrupted by some form of noise. Denoising where input. x autoencoders must therefore undo this corruption rather than simply copying their Denoising training forces f and g to implicitly learn the structure of p data(x), input. as sho shown wn b by y Alain and Bengio (2013) and Bengio et al. (2013c). Denoising (x), Denoising training forces f and g to implicitly learn the structure of p 510 as shown by Alain and Bengio (2013) and Bengio et al. (2013c). Denoising
CHAPTER 14. AUTOENCODERS
auto autoenco enco encoders ders th thus us pro provide vide yet another example of how useful prop properties erties can emerge as a byproduct of minimizing reconstruction error. They are also an example of auto ders thus pro vide yet another example useful propenco ertiesders can so emerge ho how w enco ov overcomplete, ercomplete, high-capacity mo models dels ma may yofbhow e used as auto autoenco encoders long as a byproduct of minimizing reconstruction error. They are also an example of as care is taken to prev prevent ent them from learning the iden identity tity function. Denoising ho w ov ercomplete, high-capacity mo dels ma y b e used as auto enco ders so long auto autoenco enco encoders ders are presen presented ted in more detail in Sec. 14.5. as care is taken to prevent them from learning the identity function. Denoising autoencoders are presented in more detail in Sec. 14.5.
14.2.3
Regularizing by Penalizing Deriv Derivativ ativ atives es
14.2.3 strategy Regularizing by Penalizing Deriv ativ esa penalty Ω as in sparse Another for regularizing an auto autoenco enco encoder der is to use auto autoenco enco encoders, ders, Another strategy for regularizing enco derhis sparse L(x,an g(fauto (x))) + Ω( , xto ), use a penalty Ω as in(14.10) autoencoders, but with a different form of Ω L(: x, g(f (x))) + Ω(h, x), (14.10) X but with a different form ofΩ( Ω:h, x) = λ ||∇xhi ||2 . (14.11) i
Ω(h, x) = λ h . (14.11) x This forces the model to learn a function ||∇ that does not change muc when much h || changes slightly slightly.. Because this penalty is applied only at training examples, it forces This forces the to learn a function that does not change muc h training when x the auto autoenco enco encoder dermodel to learn features that capture information ab about out the cdistribution. hanges slightly. Because this penalty is applied only at training examples, it forces X the autoencoder to learn features that capture information about the training An auto autoenco enco encoder der regularized in this way is called a contr ontractive active auto autoenc enc enco oder distribution. or CAE. This approac approach h has theoretical connections to denoising auto autoenco enco encoders, ders, An auto enco der regularized in this w ay is called a c ontr active auto enc oder manifold learning and probabilistic mo modeling. deling. The CAE is describ described ed in more detail or CAE . This approac h has theoretical connections to denoising auto enco ders, in Sec. 14.7. manifold learning and probabilistic modeling. The CAE is described in more detail in Sec. 14.7.
14.3
Represen Representational tational Po Power, wer, La Lay yer Size and Depth
Auto Autoenco encoders ders are oftentational trained withPo only a single lay yer Size enco encoder der and aDepth single lay layer er 14.3enco Represen wer, Layla er and deco decoder. der. How However, ever, this is not a requiremen requirement. t. In fact, using deep encoders and Auto encoders often trained with only a single layer enco der and a single layer deco decoders ders offersare man many y adv advan an antages. tages. decoder. However, this is not a requirement. In fact, using deep encoders and Recall from Sec. 6.4.1 that there are many adv advan an antages tages to depth in a feedforw feedforward ard decoders offers many advantages. net netw work. Because auto autoenco enco encoders ders are feedforw feedforward ard net netw works, these adv advantages antages also Recall from Sec. 6.4.1 that there are many adv an tages to depth in a apply to auto autoenco enco encoders. ders. Moreov Moreover, er, the enco encoder der is itself a feedforward feedforw net netw workard as netthe work. Because auto ders components are feedforward netautoenco works, these advantages also is deco decoder, der, so eac each h enco of these of the autoencoder der can individually apply autodepth. encoders. Moreover, the encoder is itself a feedforward network as b enefittofrom is the decoder, so each of these components of the autoencoder can individually One ma major jor adv advantage antage of non-trivial depth is that the univ universal ersal appro approximator ximator benefit from depth. theorem guaran guarantees tees that a feedforw feedforward ard neural netw network ork with at least one hidden One jor advtantage of non-trivial is that (within the univaersal appro ximator la layer yer canma represen represent an approximation of depth any function broad class) to an theorem guarantees that a feedforward neural network with at least one hidden layer can represent an approximation of511 any function (within a broad class) to an
CHAPTER 14. AUTOENCODERS
arbitrary degree of accuracy accuracy,, provided that it has enough hidden units. This means that an autoenco autoencoder der with a single hidden la lay yer is able to represent the identit identity y arbitrary degree of accuracy , provided that it has enough hidden units. This means function along the domain of the data arbitrarily well. Ho Howev wev wever, er, the mapping from that an autoenco der with a single hidden la y er is able to represent thearbitrary identity input to co code de is shallo shallow. w. This means that we are not able to enforce function along the domain of the data arbitrarily w ell. Ho wev er, the mapping from constrain constraints, ts, such as that the co code de should be sparse. A deep auto autoenco enco encoder, der, with at inputone to co de is shallo w. This means are ablecan to enforce arbitrary least additional hidden lay layer er insidethat the we enco encoder dernot itself, approximate any constraints, such as that the code shouldwell, be sparse. A deephidden autoenco der, with at mapping from input to co code de arbitrarily given enough units. least one additional hidden layer inside the encoder itself, can approximate any Depth can exp exponentially onentially reduce the computational cost of represen representing ting some mapping from input to code arbitrarily well, given enough hidden units. functions. Depth can also exp exponentially onentially decrease the amoun amountt of training data Depth can exp onentially reduce cost of ting some needed to learn some functions. Seethe Sec.computational 6.4.1 for a review of represen the adv advantages antages of functions. Depth can also exp onentially decrease the amoun t of training data depth in feedforw feedforward ard netw networks. orks. needed to learn some functions. See Sec. 6.4.1 for a review of the advantages of Exp Experimen erimen erimentally tally tally,, deep auto autoenco enco encoders ders yield muc uch h better compression than corredepth in feedforward networks. sp sponding onding shallo shallow w or linear auto autoenco enco encoders ders (Hin Hinton ton and Salakh Salakhutdinov utdinov, 2006). Experimentally, deep autoencoders yield much better compression than correA common strategy for training a deep autoenco autoencoder der is to greedily pretrain sponding shallow or linear autoencoders (Hinton and Salakhutdinov, 2006). the deep arc architecture hitecture by training a stac stack k of shallow autoenco autoencoders, ders, so we often A common strategy for training a deep autoenco der is to encoun encounter ter shallo shallow w auto autoenco enco encoders, ders, ev even en when the ultimate goal isgreedily to trainpretrain a deep the deep arc hitecture by training a stac k of shallow autoenco ders, so we often auto autoenco enco encoder. der. encounter shallow autoencoders, even when the ultimate goal is to train a deep autoencoder.
14.4
Sto Stocchastic Enco Encoders ders and Deco Decoders ders
Auto Autoenco encoders ders c are just feedforw feedforward ard netw networks. orks. The same loss functions and output 14.4enco Sto hastic Enco ders and Deco ders unit typ ypes es that can be used for traditional feedforward netw networks orks are also used for Auto enco ders are just feedforw ard netw orks. The same loss functions and output auto autoenco enco encoders. ders. unit types that can be used for traditional feedforward networks are also used for As described in Sec. 6.2.2.4, a general strategy for designing the output units autoencoders. and the loss function of a feedforward netw network ork is to define an output distribution in Sec. , a general strategy for pdesigning output units | x)described p(y As (y | x). Inthe y and minimize the6.2.2.4 negative log-likelihoo log-likelihood d − log that setting, and the loss function of a feedforward netw ork is to define an output distribution was a vector of targets, such as class lab labels. els. p(y x) and minimize the negative log-likelihood log p(y x). In that setting, y In the case of an auto autoenco enco encoder, der, x is no now w the target as well as the input. Ho Howev wev wever, er, was |a vector of targets, such as class labels. − | we can still apply the same mac machinery hinery as before. Given a hidden co code de h, we may In of thethe casedeco of an der, x ais conditional now the target as well as pthe input. Howev er, think decoder derauto as enco providing distribution We e decoder (x | h). W h w e can still apply the same mac hinery as b efore. Given a hidden co de , w e may ma may y then train the autoenco autoencoder der by minimizing − log pdecoder(x | h). The exact think of the deco der as providing a conditional distribution We h).with (x . As form of this loss function will change dep depending ending on the form pof pdecoder log p ( x h ) ma y then train the autoenco der by minimizing . The exact | traditional feedforw feedforward ard netw networks, orks, we usually use linear output units to parametrize p form of this loss function will change dep ending on the form of . As withe − | the negativ the mean of a Gaussian distribution if x is real-v real-valued. alued. In that case, negative traditional feedforw we usually linear output units to parametrize log-lik log-likeliho eliho elihoo od yieldsard a netw meanorks, squared error use criterion. Similarly Similarly, , binary x values x the mean of a Gaussian distribution if is real-v alued. In that case, the e corresp correspond ond to a Bernoulli distribution whose parameters are given by a negativ sigmoid log-lik eliho o d yields a mean squared error criterion. Similarly , binary v alues x output unit, discrete x values correspond to a softmax distribution, and so on. correspond to a Bernoulli distribution whose parameters are given by a sigmoid output unit, discrete x values correspond 512 to a softmax distribution, and so on.
CHAPTER 14. AUTOENCODERS
Typically ypically,, the output variables are treated as being conditionally indep independent endent giv given en h so that this probability distribution is inexp inexpensiv ensiv ensivee to ev evaluate, aluate, but some T ypically , the variables are treated allo as bweing conditionally indep endent tec techniques hniques suchoutput as mixture densit density y outputs allow tractable mo modeling deling of outputs h givencorrelations. so that this probability distribution is inexpensive to evaluate, but some with techniques such as mixture density outputs allow tractable modeling of outputs with correlations. h
pencoder (h | x)
pdecoder (x | h)
x
r
Figure 14.2: The structure of a sto stochastic chastic autoenco autoencoder, der, in which both the enco encoder der and the deco decoder der are not simple functions but instead in inv volve some noise injection, meaning that Figureoutput 14.2: The a stochastic der, in which both der enco and der the pencoder ( h |the their can structure be seen asofsampled fromautoenco a distribution, for the encoder x) enco decopder are not simple functions but instead in v olve some noise injection, meaning that and ( x | h ) for the decoder. decoder ( h x) for the encoder their output can be seen as sampled from a distribution, p p ( x h ) and for the decoder. | To make a more radical departure from the feedforw feedforward ard netw networks orks we ha have ve seen |
previously previously,, we can also generalize the notion of an enc enco oding function f( x) to an T o make a more radical departure from the feedforw ard14.2 netw enc enco oding distribution pencoder (h | x), as illustrated in Fig. . orks we have seen previously, we can also generalize the notion of an encoding function f( x) to an An Any y latent variable mo model del pmodel (h, x) defines a sto stochastic chastic enco encoder der encoding distribution p (h x), as illustrated in Fig. 14.2. =) pdefines x)chastic enco der (14.12) Any latent variable mopencoder del p | (h |(x h), x model (ha| sto and a stochastic deco decoder der p
(h x) = p (h x) | | and a stochastic decoder pdecoder(x | h) = p model (x | h).
(14.12) (14.13)
In general, the enco encoder der and decoder distributions (der x h )=p (xareh)not . necessarily conditional (14.13) p deco p ( x , h distributions compatible with a unique join distribution ) . Alain et al. model | jointt | In general, the enco der and deco der distributions are not necessarily conditional (2015) sho showed wed that training the enco encoder der and deco decoder der as a denoising autoenco autoencoder der p ( x , h distributions compatible with a unique join t distribution ) . Alain et al. will tend to make them compatible asymptotically (with enough capacit capacity y and (examples). 2015) showed that training the encoder and decoder as a denoising autoencoder will tend to make them compatible asymptotically (with enough capacity and examples).
14.5
Denoising Auto Autoenco enco encoders ders
The denoising auto autoenc enc enco oderAuto (DAE)enco is an ders autoenco autoencoder der that receiv receives es a corrupted data 14.5 Denoising poin ointt as input and is trained to predict the original, uncorrupted data point as its The denoising autoencoder (DAE) is an autoencoder that receives a corrupted data output. point as input and is trained to predict the original, uncorrupted data point as its The DAE training procedure is illustrated in Fig. 14.3. We in intro tro troduce duce a output. ˜ | x) which represents a conditional distribution oov corruption pro process cess C (x ver The DAE training procedure is illustrated in Fig. 14.3. We introduce a 513 ˜ x) which represents corruption process C (x a conditional distribution over |
CHAPTER 14. AUTOENCODERS
h f ˜x
g
L
C(˜x | x) x
Figure 14.3: The computational graph of the cost function for a denoising auto autoenco enco encoder, der, ˜. whic which h is trained to reconstruct the clean data p oin ointt x from its corrupted version x Figure The computational graphthe of loss the cost for a (denoising der, ˜ )) This is 14.3: accomplished by minimizing )),enco , where L =function (x − log pdecoder x | h = fauto ˜. whic trained to reconstruct the clean datax,pobtained oint x from its corrupted version x x ˜ is ha is corrupted version of the data example through a giv given en corruption ˜ )) L = ( = f ( x log p x h This is accomplished by minimizing the loss , ewhere ˜ | x). Typically pro process cess the distribution whose mean C (x pdecoder is a factorial distribution whos x ˜ is a corrupted x, gobtained versionbyofa the data example through| a given corruption parameters are emitted feedforward netw network ork .− ˜ x). Typically the distribution p process C (x is a factorial distribution whose mean g parameters are | emitted by a feedforward network .
˜ , given a data sample x. The auto corrupted samples x autoenco enco encoder der then learns a ˜ ˜ ), reconstruction distribution preconstruct(x | x) estimated from training pairs (x, x ˜ x x corrupted samples , given a data sample . The auto enco der then learns a as follo follows: ws: ˜ ) estimated from training pairs (x, x ˜ ), (x x reconstruction distribution p as 1. follo ws: Sample a training example x from| the training data. ˜training | x = x)data. . 1. Sample a corrupted training example 2. version xx˜ from from the C (x
˜ x =the x).auto 2. Use Sample version x˜ from C (x 3. ( x, x˜a) corrupted as a training example for estimating autoenco enco encoder der reconstruction ˜ ) = pdecoder ( x distribution p reconstruct (x | x encoder der | | h) with h the output of enco x , x ˜ 3. fUse ( ) as a training example for estimating the auto enco der reconstruction ˜ x ) and p t ypically defined by a deco g ( h ) . decoder der ( decoder (x x ˜) = p ( x h) with h the output of encoder distribution p and by a deco der g(h). minimization (such f (x˜)we | defined |approximate Typically canp simplytypically perform gradient-based as minibatc minibatch h gradient descent) on the negativ negativee log-likelihoo log-likelihood d − log pdecoder(x | h ). T ypically we can simply p erform gradient-based approximate minimization (such So long as the enco encoder der is deterministic, the denoising auto autoenco enco encoder der is a feedforward h). as minibatc h gradient theexactly negativthe e log-likelihoo d log p as an net netw work and ma may y b bee descent) trained on with same tec techniques hniques any y(xother So long as the enco der is deterministic, the denoising auto enco der is a feedforward − | feedforw feedforward ard netw network. ork. network and may be trained with exactly the same techniques as any other We can therefore view the DAE as performing sto stochastic chastic gradient descent on feedforward network. the follo following wing exp expectation: ectation: We can therefore view the DAE as performing stochastic gradient descent on the following exp− ectation: ˜)) Ex∼ˆpdata (x) Ex∼C(˜ (14.14) x|x) log pdecoder(x | h = f (x ˜ E E ˜)) (x h = f (x log p (14.14) where pˆdata (x) is the training distribution. − | where pˆ (x) is the training distribution. 514
CHAPTER 14. AUTOENCODERS
˜x x
gf
˜x C(˜x | x) x
˜ bac Figure 14.4: A denoising autoenco autoencoder der is trained to map a corrupted data poin ointt x back k to the original data point x. We illustrate training examples x as red crosses lying near a ˜ back to x Figure 14.4: A denoising derwith is trained to black map aline. corrupted data poin lo low-dimensional w-dimensional manifold autoenco illustrated the bold We illustrate thetcorruption x x the original point . Wcircle e illustrate training examples red crosses lying near a pro process cess a gray of equiprobable corruptions. as A gray arro arrow w demonstrates C (x ˜ data | x) with lo w-dimensional manifold illustrated with the b old black line. W e illustrate the corruption ho one training example is transformed into one sample from this corruption pro how w process. cess. ) C ( ˜ x x pro cess with a gray circle of equiprobable corruptions. A gray arro w demonstrates When the denoising autoenco autoencoder der is trained to minimize the av average erage of squared errors ho training example is transformed into one sample from this cess. ˜ )) − ˜ )) estimates ˜ ]. Thepro reconstruction vector || ||gg(wf (one x|||2 , the E x,x∼p x g( f ( x ˜ data (x)C (˜ x|xcorruption )[ x | x When autoenco dertow is trained minimize the av erage of squared errors poin oints ts appro approximately ximately towards ards the to nearest point on the manifold, since g ( f( x ˜))the g ( f (x ˜)) − x˜denoising E ˜ ˜ ˜ )) ] g x x ( f ( x )) g ( f ( x [ x , the reconstruction estimates . The vector ˜ . The estimates the center of mass of the clean points x whic which h could hav havee given rise to x g||( f( x ˜encod )) −x˜erp||thus (x ˜)) oints learns approximately ards point on by thethe manifold, since g (fThis | arrows. auto autoencod encoder a vector tow field green g (fthe (x ))nearest − x indicated ˜ x x center ofthe mass of the points could have given risethat to .is The − the vestimates ector field estimates score to ahmultiplicativ multiplicative e factor the log pdata (x ) upwhic ∇ x clean g ( f ( x )) x auto encod er thus learns a vector field indicated by the green arrows. This average ro root ot mean square reconstruction error. log p (x )−up to a multiplicative factor that is the vector field estimates the score average root mean square reconstruction error. ∇
515
CHAPTER 14. AUTOENCODERS
14.5.1
Estimating the Score
Score matching hing (Hyvärinen 2005) is an alternative to maximum lik likeliho eliho elihoo od. It 14.5.1matc Estimating the, Score pro provides vides a consistent estimator of probabilit probability y distributions based on encouraging Score matc hing ( Hyvärinen , 2005 ) is an alternative to maximum likelihoopd.oint It the mo model del to hav havee the same sc scor or oree as the data distribution at every training pro vides consistent probabilitgradien y distributions x . In thisa con context, text, theestimator score is aofparticular gradient t field: based on encouraging the model to have the same score as the data distribution at every training point x. In this context, the score is a particular t field: ∇x log p(xgradien ). (14.15) (x). 18.4. For the presen (14.15) Score matc matching hing is discussed further log in pSec. presentt discussion ∇ t to understand that learning the gradien regarding auto autoenco enco encoders, ders, it is sufficien sufficient gradientt Score matc hing is discussed further in Sec. 18.4 . F or the presen t discussion field of log pdata is one way to learn the structure of pdata itself. regarding autoencoders, it is sufficient to understand that learning the gradient important ortant prop property erty of D AEs is thatoftheir fieldAofvery log pimp is one w ay to learn the structure p training itself. criterion (with conditionally Gaussian p( x | h)) mak makes es the auto autoenco enco encoder der learn a vector field A very imp ortant prop erty of D AEs is that their training criterion (with (g ( f ( x)) − x) that estimates the score of the data distribution. This is illustrated x h p ( conditionally Gaussian ) ) mak es the auto enco der learn a v ector field in Fig. 14.4. (g ( f ( x)) x) that estimates the | score of the data distribution. This is illustrated Denoising training of a sp specific ecific kind of auto autoenco enco encoder der (sigmoidal hidden units, in Fig. 14.4 −. linear reconstruction units) using Gaussian noise and mean squared error as the Denoising training of a sp ecific kind of ,auto enco (sigmoidal hidden units, reconstruction cost is equiv equivalen alen alent t (Vincent 2011 ) toder training a sp specific ecific kind of linear reconstruction units) using Gaussian noise and mean squared error as the undirected probabilistic mo model del called an RBM with Gaussian visible units. This reconstruction costbeisdescrib equivalen t (detail Vincent 201120.5.1 ) to ;training a specific kind of kind of model will described ed in in ,Sec. for the presen present t discussion undirected called an RBM with visible This (xunits. ; θ). When it suffices toprobabilistic kno know w that itmo is del a mo model del that provides anGaussian explicit pmodel kindRBM of model will beusing describ ed in detail Sec. 20.5.1 ; for theand presen t discussion the is trained denoising sc scor or oreeinmatching (Kingma LeCun , 2010), p ( x ; θ it suffices to kno w that it is a mo del that provides an explicit ) . When its learning algorithm is equiv equivalen alen alentt to denoising training in the corresp corresponding onding the RBM is trained denoising ore matching LeCun , 2010), auto autoenco enco encoder. der. With a using fixed noise level, sc regularized score(Kingma matchingand is not a consisten consistent t its learning algorithm is equiv alen t to denoising training in the corresp onding estimator; it instead recov recovers ers a blurred version of the distribution. Ho How wev ever, er, if auto enco der. With a fixed noise level, regularized score matching is not a consisten the noise lev level el is chosen to approac approach h 0 when the number of examples approac approaches hest estimator; it instead recov ers a blurred version of the distribution. Ho w ev er, if infinit infinity y, then consistency is recov recovered. ered. Denoising score matching is discussed in the noise levin el Sec. is chosen more detail 18.5. to approach 0 when the number of examples approaches infinity, then consistency is recovered. Denoising score matching is discussed in Other connections betw etween een auto autoenco enco encoders ders and RBMs exist. Score matc matching hing more detail in Sec. 18.5. applied to RBMs yields a cost function that is identical to reconstruction error Other between auto enco ders and RBMs exist. eScore matc com combined bined connections with a regularization term similar to the contractiv contractive penalty ofhing the applied toersky RBMs yields function is identical reconstruction error CAE (Sw Swersky et al. al., , 2011a).cost Bengio and that Delalleau (2009) to show showed ed that an auto autoenencom bined with a regularization term similar to the contractiv e p enalty of the co coder der gradien gradientt provides an approximation to contrastiv contrastivee div divergence ergence training of CAE ( Sw ersky et al. , 2011 ). Bengio and Delalleau ( 2009 ) show ed that an autoenRBMs. coder gradient provides an approximation to contrastive divergence training of For contin continuous-v uous-v uous-valued alued x, the denoising criterion with Gaussian corruption and RBMs. reconstruction distribution yields an estimator of the score that is applicable to x, the For contin uous-v denoising criterion withand Gaussian corruption and general enco encoder der andalued decoder parametrizations (Alain Bengio , 2013). This reconstruction yields an estimator of the score to that is applicable to means a genericdistribution enco encoder-deco der-deco der-decoder der architecture may be made estimate the score general encoder and decoder parametrizations (Alain and Bengio, 2013). This 516 means a generic encoder-decoder architecture may be made to estimate the score
CHAPTER 14. AUTOENCODERS
by training with the squared error criterion (14.16) || ||gg(criterion f (x˜)) − x|| 2 by training with the squared error and corruption (14.16) g(f (x˜)) x ˜=x ˜ |x) = N (x ˜ ; µ = x, Σ = σ2I ) C (x (14.17) || − || and corruption 2 with noise variance σ .CSee ˜ =Fig. ˜ x14.5 ˜an x ) = for(x (x ; µ illustration = x, Σ = σ of I )how this works.(14.17) with noise variance σ . See Fig.| 14.5 N for an illustration of how this works.
Figure 14.5: Vector field learned by a denoising auto autoenco enco encoder der around a 1-D curved manifold near which the data concen concentrates trates in a 2-D space. Each arro arrow w is prop proportional ortional to the Figure 14.5: Vector learned denoising auto around a 1-Dhigher curved manifold reconstruction minusfield input vectorbyofathe auto autoenco enco encoder derenco andder points tow towards ards probability near whichtothe concen trates inprobability a 2-D space. Each arroThe w isvector prop ortional the according the data implicitly estimated distribution. field hastozeros reconstruction minus input vector densit of the yauto encoder towards higher at both maxima of the estimated density function (onand thepoints data manifolds) and probability at minima according to the implicitly distribution. The vector field has zeros of that density function. Forestimated example, probability the spiral arm forms a one-dimensional manifold of atcal both maxima of are the connected estimated densit y function (on the data app manifolds) minima lo local maxima that to each other. Local minima appear ear nearand theat middle of of that ForWhen example, spiral arm forms a one-dimensional of the gapdensity betw etween eenfunction. tw two o arms. the the norm of reconstruction error (shown bymanifold the length lo cal maxima that are connected to each other. Local minima app ear near the middle of of the arrows) is large, it means that probabilit probability y can be significantly increased by moving thethe gapdirection between of twothe arms. When norm of reconstruction (shown byprobabilit the length in arrow, andthe that is mostly the case inerror places of low probability y. of the arrows) is large, means probabilit y cantobehigher significantly increased by moving The auto autoenco enco encoder der maps it these lowthat probability points probabilit probability y reconstructions. in the probabilit direction yofisthe arrow,the andarrows that is mostly the case in places of low probabilit Where probability maximal, shrink because the reconstruction becomes morey. The autoencoder maps these low probability points to higher probability reconstructions. accurate. Where probability is maximal, the arrows shrink because the reconstruction becomes more accurate. In general, there is no guaran guarantee tee that the reconstruction g (f (x )) minus the
input x corresp corresponds onds to the gradient of an any y function, let alone to the score. That is In general, there is no guarantee that the reconstruction g (f (x )) minus the 517 input x corresponds to the gradient of an y function, let alone to the score. That is
CHAPTER 14. AUTOENCODERS
wh why y the early results (Vincent, 2011) are sp specialized ecialized to particular parametrizations where g( f( x)) − x ma may y be obtained by taking the deriv derivative ative of another function. wh y the early results ( Vincent , 2011 ) are sp ecialized to particular parametrizations Kam Kamyshansk yshansk yshanska a and Memisevic (2015) generalized the results of Vincen Vincent t (2011) by g ( f ( x )) x where ma y b e obtained b y taking the deriv ative of another function. iden identifying tifying a family of shallo shallow w auto autoenco enco encoders ders such that g (f( x)) − x corresp corresponds onds to Kam yshansk a and Memisevic ( 2015 ) generalized the results of Vincen t ( 2011 ) by − a score for all members of the family family.. identifying a family of shallow autoencoders such that g (f( x)) x corresponds to So far we hav havee describ described ed only ho how w the denoising auto autoenco enco encoder der learns to represen representt a score for all members of the family. − a probabilit probability y distribution. More generally generally,, one may wan wantt to use the auto autoenco enco encoder der as So far w e hav e describ ed only ho w the denoising auto enco der learns to represen a generativ generativee mo model del and draw samples from this distribution. This will be describ described edt alater, probabilit y distribution. More generally , one may wan t to use the auto enco der as in Sec. 20.11. a generative model and draw samples from this distribution. This will be described later, in Sec. 20.11. 14.5.1.1 Historical Persp erspectiv ectiv ective e 14.5.1.1 Persp ective dates bac The idea ofHistorical using MLPs for denoising back k to the work of LeCun (1987) and Gallinari et al. (1987). Behnk Behnkee (2001) also used recurrent netw networks orks to denoise The idea of using MLPs for denoising dates bac k to the work of LeCun (1987) images. Denoising auto autoenco enco encoders ders are, in some sense, just MLPs trained to denoise. andwev Gallinari et al. (“denoising 1987). Behnk eenco (2001 ) also usedtorecurrent networks to denoise Ho auto refers a mo is in not Howev wever, er, the name autoenco encoder” der” model del that intended tended images. Denoising auto enco ders are, in some sense, just MLPs trained to denoise. merely to learn to denoise its input but to learn a go goo od in internal ternal representation Ho wev er, the name “denoising auto enco der” refers to a mo del that intended nott as a side effect of learning to denoise. This idea came m much uch is later ((Vincen Vincen Vincent merely to learn to denoise its input but to learn a go o d in ternal representation et al. al.,, 2008, 2010). The learned representation may then be used to pretrain a as a erside effect of learning This idea uch later (Vincen t deep deeper unsupervised netw network ork to or denoise. a sup supervised ervised netw network. ork.came Like m sparse auto autoenco enco encoders, ders, et al., 2008 , 2010 ). The elearned representation may then be used to pretrain sparse coding, contractiv contractive auto autoenco enco encoders ders and other regularized auto autoenco enco encoders, ders, thea deeperation unsupervised or a sup netw Likehigh-capacit sparse autoyenco ders, motiv motivation for DAEsnetw wasork to allow theervised learning ofork. a very high-capacity enco encoder der sparse coding, contractiv e auto enco ders and other regularized auto enco ders, the while prev preventing enting the enco encoder der and deco decoder der from learning a useless identit identity y function. motivation for DAEs was to allow the learning of a very high-capacity encoder Prior the the introduction of deco the der mo modern dern Ina Inayoshi and Kurita (2005) while prevto enting encoder and from DAE, learning ayoshi useless identit y function. explored some of the same goals with some of the same metho methods. ds. Their approac approach h Prior to the introduction of the mo dern DAE, Ina yoshi and Kurita ( 2005 minimizes reconstruction error in addition to a sup supervised ervised ob objectiv jectiv jectivee while injecting) explored some of the same goals with some of the same metho ds. Their he noise in the hidden lay layer er of a sup supervised ervised MLP MLP,, with the ob objectiv jectiv jective e to approac improv improve minimizes reconstruction error inthe addition to a supervised jectiv while injecting generalization by in reconstruction error ob and thee injected noise. intro tro troducing ducing noise in the hidden lay er of a sup ervised MLP , with the ob jectiv e to improv Ho Howev wev wever, er, their metho method d was based on a linear enco encoder der and could not learn functione generalization by inas trocan ducing the reconstruction error and the injected noise. families as pow powerful erful the modern DAE. However, their method was based on a linear encoder and could not learn function families as powerful as can the modern DAE.
14.6
Learning Manifolds with Auto Autoenco enco encoders ders
Lik Like e man many y other mac machine hine learning algorithms, auto autoenco enco encoders ders 14.6 Learning Manifolds with Auto enco dersexploit the idea that data concentrates around a lo low-dimensional w-dimensional manifold or a small set of suc such h Lik e man y other mac hine learning algorithms, auto enco ders exploit the idea manifolds, as describ described ed in Sec. 5.11.3. Some machine learning algorithms exploit that idea dataonly concentrates a lolearn w-dimensional orves a small set of h this insofar asaround that they a functionmanifold that beha ehaves correctly onsuc the manifolds,but as ma describ in usual Sec. 5.11.3 . Some machine learning algorithms exploit manifold may y ha hav ved e un unusual beha ehavior vior if given an input that is off the manifold. this idea only insofar as that they learn a function that behaves correctly on the 518 manifold but may have unusual behavior if given an input that is off the manifold.
CHAPTER 14. AUTOENCODERS
Auto Autoenco enco encoders ders tak takee this idea further and aim to learn the structure of the manifold. To understand ho how w auto autoenco enco encoders ders do this, we must present some important Autoencoders take this idea further and aim to learn the structure of the manifold. characteristics of manifolds. To understand how autoencoders do this, we must present some important An imp important ortant characterization of a manifold is the set of its tangent planes. At characteristics of manifolds. x d a point on a -dimensional manifold, the tangen tangentt plane is given by d basis vectors An imp ortant characterization of a manifold is the set manifold. of its tangent planes. At that span the local directions of variation allo allowed wed on the As illustrated d-dimensional d basis vectors a pFig. oint x on, athese manifold, the tangen t plane given bxyinfinitesimally in 14.6 lo local cal directions sp specify ecify how one can ischange that span the local directions of v ariation allo wed on the manifold. As illustrated while sta staying ying on the manifold. in Fig. 14.6, these local directions specify how one can change x infinitesimally Allsta auto autoenco encoder pro procedures cedures in involv volv volvee a compromise betw etween een two forces: while yingenco on der the training manifold. autoencoader training procedures volve a example compromise betw een tx wocan forces: h of a in x suc 1.AllLearning represen representation tation training such h that be appro approximately ximately recov recovered ered from h through a deco decoder. der. The fact that x is drawn h of a b xder 1. from Learning a represen tation training example h that can be the training data is crucial, ecause it means xthesuc auto autoenco enco encoder need h x appro ximately recov ered from through a deco der. The fact that is drawn not successfully reconstruct inputs that are not probable under the data from the training data is crucial, because it means the autoencoder need generating distribution. not successfully reconstruct inputs that are not probable under the data distribution. 2. generating Satisfying the constrain constraintt or regularization penalt enalty y. This can be an arc architechitectural constraint that limits the capacity of the auto or it can be autoenco enco encoder, der, 2. aSatisfying the constrain t or regularization p enalt y . This can b e an arc hitecregularization term added to the reconstruction cost. These techniques tural constraint that limits the of thetoauto der, or it can be generally prefer solutions that arecapacity less sensitive theenco input. a regularization term added to the reconstruction cost. These techniques generally prefer solutions that sensitiveying to the Clearly Clearly, , neither force alone w ouldare beless useful—cop useful—copying theinput. input to the output is not useful on its own, nor is ignoring the input. Instead, the two forces together neither they force force alonethe would be useful—cop ying the input toinformation the output are Clearly useful ,because hidden represen representation tation to capture is not on its own, nordata is ignoring the input. Instead,The theimp twoortant forces principle together ab about out useful the structure of the generating distribution. important arethat useful they force the to hidden represen to capture is the because auto autoenco enco encoder der can afford represent onlytation the variations thatinformation ar aree ne neeede ded d ab out the structure of the data generating distribution. The imp ortant principle to reconstruct tr training aining examples examples.. If the data generating distribution concentrates is that the auto enco der can affordthis to yields represent only tations the variations that arecapture needed near a low-dimensional manifold, represen representations that implicitly recal construct training examples . If manifold: the data generating concentrates ato lo local co coordinate ordinate system for this only the vdistribution ariations tangent to the near a low-dimensional manifold, this yields represen tations that implicitly capture manifold around x need to correspond to changes in h = f (x). Hence the enco encoder der a lo cal co ordinate system for this manifold: only the v ariations tangent to the learns a mapping from the input space x to a represen representation tation space, a mapping that x cneed (x).that manifold aroundto to correspond to changes in h = fbut Hence the encoeder is only sensitive hanges along the manifold directions, is insensitiv insensitive to a mapping from space x to a representation space, a mapping that clearns hanges orthogonal to the the input manifold. is only sensitive to changes along the manifold directions, but that is insensitive to A one-dimensional example is illustrated in Fig. 14.7, sho showing wing that by making changes orthogonal to the manifold. the reconstruction function insensitiv insensitivee to perturbations of the input around the A one-dimensional example is illustrated in Fig. 14.7, showing that by making data poin oints ts we recov recover er the manifold structure. the reconstruction function insensitive to perturbations of the input around the To understand wh why y auto autoenco enco encoders ders are useful for manifold learning, it is instrucdata points we recover the manifold structure. tiv tivee to compare them to other approaches. What is most commonly learned to To understand why auto ders are useful learning, it near) is instruccharacterize a manifold is aenco repr epresentation esentation of for themanifold data points on (or the tive to compare them to other approaches. What is most commonly learned to 519 of the data points on (or near) the characterize a manifold is a representation
CHAPTER 14. AUTOENCODERS
Figure 14.6: An illustration of the concept of a tangent hyperplane. Here we create a one-dimensional manifold in 784-dimensional space. We take an MNIST image with 784 Figureand 14.6:transform An illustration of the concept of a tangent hyperplane. Here we create a pixels it by translating it vertically vertically. . The amoun amount t of vertical translation one-dimensional manifold We take MNIST with 784 defines a co coordinate ordinate alongina784-dimensional one-dimensionalspace. manifold thatan traces out image a curv curved ed path pixels and transform it b y translating it vertically . The amoun t of vertical translation through image space. This plot sho shows ws a few points along this manifold. For visualization, defines cojected ordinate a one-dimensional manifold curved path w e hav havee apro projected the along manifold in into to tw two o dimensional spacethat usingtraces PCA.out An an -dimensional through image This plot tangent shows a plane few points along this This manifold. Forplane visualization, manifold has anspace. at every point. tangent touc touches hes n-dimensional n w e hav e pro jected the manifold in to tw o dimensional space using PCA. An -dimensional the manifold exactly at that point and is oriented parallel to the surface at that p oint. n-dimensional manifold tangent planeitatisevery point. plane touches It defineshas theanspace of directions in which possible to This mov moveetangent while remaining on manifold exactly at that point and is oriented to the surface that p oint. the manifold. This one-dimensional manifold has aparallel single tangent line. Wat e indicate an It definestangent the space which is possible to mov e while remaining on example lineofatdirections one p oint,inwith an it image sho ho this tangen direction showing wing how w tangentt the manifold. This one-dimensional manifold has a single tangent line. W e indicate an app appears ears in image space. Gray pixels indicate pixels that do not change as w mov movee along example tangent line at one indicate p oint, with image sho wing w this tangen t direction the tangent line, white pixels pixelsanthat brigh brighten, ten, andho black pixels indicate pixels appears in image space. Gray pixels indicate pixels that do not c hange as w e mov e along that dark darken. en. the tangent line, white pixels indicate pixels that brighten, and black pixels indicate pixels that darken. 520
CHAPTER 14. AUTOENCODERS
1.0
r(x)
0.8 0.6
Iden Identit tit tity y Optimal reconstruction Identity Optimal reconstruction
0.4 0.2 0.0
x0
x1
x2
x
Figure 14.7: If the autoenco autoencoder der learns a reconstruction function that is inv invariant ariant to small perturbations near the data points, it captures the manifold structure of the data. Here Figure 14.7: Ifstructure the autoenco learns of a reconstruction thatThe is inv ariantdiagonal to small the manifold is a der collection 0-dimensionalfunction manifolds. dashed p erturbations near the data p oints, it captures the manifold structure of the data. Here line indicates the identit identity y function target for reconstruction. The optimal reconstruction 0 the manifold structure is a y collection -dimensional manifolds. The diagonal function crosses the identit identity functionofwherev wherever er there is a data p oin oint. t. dashed The horizon horizontal tal line ws indicates identit function target for reconstruction. The optimal reconstruction r(x ) − x reconstruction arro arrows at the the b ottom ofy the plot indicate the direction vector function crosses identit function er pthere is tow a data oint. The horizon tal at the base of thethe arrow, in yinput space,wherev alwa always ys ointing towards ards pthe nearest “manifold” x r ( x ) arro ws atdatap the b ottom of the indicate the reconstruction direction (a single datapoint, oint, in the 1-Dplot case). The denoising auto autoenco enco encoder der explicitly tries tovector make at the base of the arrow, in input space, always p)ointing tow ards the the data nearest “manifold” − r(x the deriv derivative ative of the reconstruction function small around points. The (a single datap oint, in the 1-D case). The denoising auto enco der explicitly tries to make con contractiv tractiv tractivee auto autoenco enco encoder der do does es the same for the enco encoder. der. Although the deriv erivativ ativ ativee of r(x ) is r(xb)esmall the deriv ative of the reconstruction function ask asked ed to b e small around the data points, it can large around betw etween eenthe thedata data points. points. The The (x ) is contractiv e een autothe enco der do es thecorresp same for thetoenco Although thethe deriv ative of rwhere space b etw etween data p oints corresponds onds the der. region b et etween ween manifolds, asked to be small around data peoints, it can bative e large etween the data points.points The the reconstruction functionthe must hav have a large deriv derivative in b order to map corrupted space b etwthe eenmanifold. the data p oints corresp onds to the region b etween the manifolds, where bac back k onto the reconstruction function must have a large derivative in order to map corrupted points back onto the manifold.
manifold. Such a representation for a particular example is also called its embedding. It is typically given by a low-dimensional vector, with less dimensions manifold. Such a representation particular is also called emthan the “am “ambient” bient” space of whic which h for the amanifold is a example low-dimensional subset.itsSome b edding. It is typically given b y a low-dimensional vector, with less dimensions algorithms (non-parametric manifold learning algorithms, discussed below) directly than the “ambient” space of whic h the example, manifold is a low-dimensional Some learn an embedding for each training while others learn a subset. more general algorithmssometimes (non-parametric manifold learning algorithms, discussed elow) directly mapping, called an enco encoder, der, or representation function, bthat maps an any y learn an embedding for each training example, while others learn a more general poin ointt in the ambien ambientt space (the input space) to its embedding. mapping, sometimes called an encoder, or representation function, that maps any Manifold learning has mostly fo focused cused on unsup unsupervised ervised learning procedures that point in the ambient space (the input space) to its embedding. attempt to capture these manifolds. Most of the initial machine learning research Manifoldnonlinear learning has mostlyhas focused on unsup ervised learning procedures on learning manifolds fo focused cused on non-p non-par ar arametric ametric metho methods ds basedthat on attempt to capture these manifolds. Most of the initial machine learning research the ne near ar arest-neighb est-neighb est-neighbor or gr graph aph aph.. This graph has one node per training example and on learning nonlinear manifolds focused on non-p arametric ds based on, edges connecting near neigh neighb borshas to each other. These metho methods dsmetho (Sc Schölkopf hölkopf et al. al., the ne ar est-neighb or gr aph . This graph has one node per training example and 1998; Row Roweis eis and Saul, 2000; Tenenbaum et al. al.,, 2000; Brand, 2003; Belkin and edges connecting near neighbors to each other. These methods (Schölkopf et al., 521 et al., 2000; Brand, 2003; Belkin and 1998; Roweis and Saul, 2000; Tenenbaum
CHAPTER 14. AUTOENCODERS
Figure 14.8: Non-parametric manifold learning procedures build a nearest neighbor graph whose no nodes des are training examples and arcs connect nearest neighbors. Various pro procedures cedures Figure 14.8: Non-parametric manifold learning procedures build a nearest neighbor can thus obtain the tangent plane asso associated ciated with a neighborho neighborhoo o d of the graph asgraph well whose no des are training examples and arcs connect nearest neighbors. ariousalued procedures as a co coordinate ordinate system that asso associates ciates eac each h training example with a Vreal-v real-valued vector can thus or obtain tangent plane asso withsuc a hneighborho o d of the graph as well p osition, emb embeethe dding dding. . It is possible to ciated generalize such a representation to new examples asy aa co ordinate system thatSo asso ciates each training a real-v aluedtovector b form of interpolation. long as the num numb b er of example exampleswith is large enough cov cover er emb e dding p osition, or . It is possible to generalize suc h a representation to new examples the curv curvature ature and twists of the manifold, these approac approaches hes work well. Images from the by a form of interpolation. So long as the num b er).of examples is large enough to cover QMUL Multiview Face Dataset (Gong et al. , 2000 the curvature and twists of the manifold, these approaches work well. Images from the QMUL Multiview Face Dataset (Gong et al., 2000).
Niy Niyogi ogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hin Hinton ton and Ro Row weis, 2003; van der Maaten and Hin Hinton ton, 2008) asso associate ciate eac each h of nodes with a Niy ogi,t 2003 andthe Grimes , 2003of; W einbergerasso andciated Saul,with 2004the ; Hin ton and tangen tangent plane; Donoho that spans directions variations associated difference weis,b2003 ; van Maaten and tonb,ors, 2008as) asso ciate eac of nodes vRo ectors etw etween een theder example and itsHin neigh neighb illustrated inhFig. 14.8. with a tangent plane that spans the directions of variations associated with the difference A global coordinate system can then be obtained through an optimization or vectors between the example and its neighbors, as illustrated in Fig. 14.8. solving a linear system. Fig. 14.9 illustrates how a manifold can be tiled by a A num global can then be obtained through anes,” optimization or large numb bercoordinate of lo locally cally system linear Gaussian-like patches (or “pancak “pancakes,” because the solving a linear system. Fig. 14.9 illustrates how a manifold can be tiled by a Gaussians are flat in the tangent directions). large number of locally linear Gaussian-like patches (or “pancakes,” because the Ho How wev ever, er, a fundamen fundamental tal difficulty with such lo local cal non-parametric Gaussians are there flat inisthe tangent directions). approac approaches hes to manifold learning, raised in Bengio and Monp Monperrus errus (2005): if the Ho w ev er, there is a fundamen tal difficulty with such lo cal non-parametric manifolds are not very smo smooth oth (they ha hav ve man many y peaks and troughs and twists), approac hes to manifold learning, raised in Bengio and Monp errus ( 2005 the one ma may y need a very large num numb ber of training examples to cov cover er each one ): of ifthese manifolds are not v ery smo oth (they ha v e man y p eaks and troughs and t wists), variations, with no chance to generalize to unseen variations. Indeed, these methods one may need a very large number of training examples to cover each one of these 522 variations, with no chance to generalize to unseen variations. Indeed, these methods
CHAPTER 14. AUTOENCODERS
Figure 14.9: If the tangent planes (see Fig. 14.6) at each lo location cation are known, then they can be tiled to form a global co coordinate ordinate system or a density function. Eac Each h lo local cal patch Figure 14.9: If the (see co Fig. 14.6) system at eachor location are known, then they can be thought of astangent a lo local cal planes Euclidean coordinate ordinate as a locally flat Gaussian, or can bake” e tiled to form a global ordinate system or a density function. Eac h local patch “p “panc anc ancake” ake”, , with a very small vco ariance in the directions orthogonal to the pancake and a e thought of asina the localdirections Euclideandefining coordinate or assystem a locally or vcan ery blarge variance the system coordinate on flat the Gaussian, pancak pancake. e. A “p anc ake” , with very small provides variance an in the directions orthogonal to as theinpancake and a mixture of thesea Gaussians estimated density function, the manifold vPery large variance in the directions defining the coordinate system onneural-net the pancak e. A arzen windo window w algorithm (Vincent and Bengio , 2003 ) or its non-lo non-local cal based mixture of these Gaussians provides an estimated density function, as in the manifold varian ariantt (Bengio et al., 2006c). Parzen window algorithm (Vincent and Bengio, 2003) or its non-local neural-net based variant (Bengio et al., 2006c).
can only generalize the shap shapee of the manifold by in interp terp terpolating olating betw etween een neigh neighb boring examples. Unfortunately Unfortunately,, the manifolds in inv volv olved ed in AI problems can ha have ve very can only generalize the shap e of the manifold by in terp olating b etw een neigh boring complicated structure that can be difficult to capture from only local in interp terp terpolation. olation. examples.forUnfortunately , the manifolds invfrom olvedtranslation in AI problems canFig. have very Consider example the manifold resulting shown in 14.6 . If complicated structure that can b e difficult to capture from only local in terp olation. we watc watch h just one co coordinate ordinate within the input vector, xi , as the image is translated, Consider for example the manifold resulting from translation in in Fig.its14.6 . If we will observe that one coordinate encounters a peak or ashown trough value x w e watc h just one co ordinate within the input vector, , as the image is translated, once for ev every ery peak or trough in brightness in the image. In other words, the w e will observe that one coordinate encounters a peak orimage a trough in its drives value complexit complexity y of the patterns of brightness in an underlying template oncecomplexit for everyy peak trough inthat brightness in the by image. In other words, the the complexity of theormanifolds are generated performing simple image complexity of theThis patterns ofates brightness an underlying image template transformations. motiv motivates the use in of distributed representations anddrives deep the complexit y of the manifolds that are generated by p erforming simple image learning for capturing manifold structure. transformations. This motivates the use of distributed representations and deep learning for capturing manifold structure.
523
CHAPTER 14. AUTOENCODERS
14.7
Con Contractiv tractiv tractive e Auto Autoenco enco encoders ders
The contractiv contractive auto autoencoder encoder (Rifai et al., 2011a 14.7 Conetractiv e Auto enco ders,b) introduces an explicit regularizer on the co code de h = f (x), encouraging the deriv derivatives atives of f to be as small as possible: The contractive auto encoder (Rifai et al., 2011a,b) introduces an explicit regularizer 2 on the code h = f (x), encouraging the deriv ∂ f (atives of f to be as small as possible: x ) Ω(h) = λ (14.18) ∂x . F ∂ f (x) Ω(h) = λ . (14.18) x (sum of squared elements) of the h) is the squared Frob The penalt enalty y Ω( Ω(h robenius enius ∂norm Jacobian matrix of partial deriv derivatives atives asso associated ciated with the enco encoder der function. The penalty Ω( h) is the squared Frobenius norm (sum of squared elements) of the There is a connection betw etween een the denoising autoenco enco encoder der and the contractiv contractivee ciated auto Jacobian matrix of partial derivatives asso with the encoder function. auto Alain and Bengio ( 2013 ) sho that in the limit of small Gaussian autoenco enco encoder: der: showed wed autoencoder and the contractiv There is a connection b etw een the denoising error is equiv input noise, the denoising reconstruction equivalent alent to a con contractive tractivee auto enco der:the Alain and Bengio function (2013) sho wed maps that in x the = gof (f (small x)) p enalt enalty y on reconstruction that to rlimit )).. Gaussian In other input noise, the denoising reconstruction error is equiv alent to a con tractive words, denoising auto autoenco enco encoders ders make the reconstruction function resist small but x r = g ( f ( x p enalt y on the reconstruction function that maps to )) . In other finite-sized perturbations of the input, while con contractive tractive auto autoenco enco encoders ders mak makee the w ords, denoising encoders make the reconstruction function resist small but feature extractionauto function resist infinitesimal perturbations of the input. When finite-sized perturbations of the input, encoders the using the Jacobian-based contractiv contractive e pwhile enalt enalty ycon totractive pretrainauto features fore use f (x)mak feature extraction function resist infinitesimal p erturbations of the input. When with a classifier, the best classification accuracy usually results from applying the using the Jacobian-based e ptoenalt features forfuse f (xy) on g(fy(xto (x) con contractive tractive p enalt enalty y to f ( x) contractiv rather than )) )).. pretrain A contractiv contractive e penalt enalty with a classifier, the best classification accuracy results from applying the also has close connections to score matching, as usually discussed in Sec. 14.5.1 . contractive p enalty to f ( x) rather than to g(f (x)). A contractive penalty on f (x) The name contr ontractive active arises from the wa way y that the CAE warps space. Sp Specifiecifialso has close connections to score matching, as discussed in Sec. 14.5.1. cally cally,, because the CAE is trained to resist perturbations of its input, it is encouraged The aname contractive from y that neighborho the CAE woarps ecifito map neighborho neighborhoo od of arises input p oin oints tsthe to wa a smaller neighborhoo d of space. outputSp points. cally , because CAEasiscontracting trained to resist perturbations of itsodinput, is encouraged W e can think the of this the input neigh neighb borho orhoo to a it smaller output to map a neighborho o d of input p oin ts to a smaller neighborho o d of output points. neigh neighb b orhoo orhood. d. We can think of this as contracting the input neighborhood to a smaller output To clarify clarify,, the CAE is contractiv contractivee only lo locally—all cally—all perturbations of a training neighb orhood. x f ( x poin are mapp near to ) . Globally t ointt mapped ed Globally,, wo differen differentt poin oints ts x and x0 ma may y be T o clarify , the CAE is contractiv e only lo cally—all perturbations of a training 0 mapp mapped ed to f (x ) and f (x ) points that are farther apart than the original poin oints. ts. x f ( x x x p oin t are mapp ed near to ) . Globally , t w o differen t p oin ts and ma y b It is plausible that f be expanding in-b in-bet et etw ween or far from the data manifolds (seee mapp ed to ) and ) points that are farther apart than original ts. f ( x f ( x h ) for example what happ happens ens in the 1-D to toy y example of Fig. 14.7the ). When thepoin Ω( Ω(h f It is plausible that b e expanding in-b et w een or far from the data manifolds (see penalt enalty y is applied to sigmoidal units, one easy wa way y to shrink the Jacobian is to for example what happ in the to 1-D toy1.example of Fig. 14.7 ). CAE Whentothe Ω(h mak make e the sigmoid unitsens saturate 0 or This encourages the enco encode de) penalty is applied sigmoidal units, onesigmoid easy way to shrink Jacobian is input points with to extreme values of the that ma may y bethe interpreted astoa mak e the sigmoid units saturate to 0 or 1 . This encourages the CAE to enco de binary code. It also ensures that the CAE will spread its co code de values throughout input ofpoints with extreme alues of the hidden sigmoidunits thatcan maspan. y be interpreted as a most the hypercub hypercube e that v its sigmoidal binary code. It also ensures that the CAE will spread its code values throughout We can think of the Jacobian matrix J at a point x as approximating the most of the hypercube that its sigmoidal hidden units can span. nonlinear enco encoder der f (x) as being a linear op operator. erator. This allows us to use the word J x as approximating W e can think of the Jacobian matrix a point the “con “contractiv tractiv tractive” e” more formally formally.. In the theory ofatlinear operators, a linear op operator erator nonlinear encoder f (x) as being a linear operator. This allows us to use the word 524 of linear operators, a linear op erator “contractive” more formally. In the theory
CHAPTER 14. AUTOENCODERS
is said to be con contractive tractive if the norm of J x remains less than or equal to 1 for all unit-norm x . In other words, J is contractiv contractivee if it shrinks the unit sphere. x remains is said to b e con tractive if the norm of lessnorm thanoforthe equal to linear 1 for We can think of the CAE as penalizing Jthe Frob robenius enius local x J all unit-norm . In words,training is contractiv it shrinks the unit each sphere. f (xother appro approximation ximation of ) at every point x einiforder to encourage of W e can think of the CAE as p enalizing the F rob enius norm of the local linear these lo local cal linear op operator erator to become a con contraction. traction. approximation of f (x ) at every training point x in order to encourage each of As describ described ed in Sec. 14.6, regularized auto autoenco enco encoders ders learn manifolds by balancing these local linear operator to become a contraction. two opp opposing osing forces. In the case of the CAE, these two forces are reconstruction As describ ed in Sec. e14.6 , regularized autoencoders learn byencourage balancing h ). Reconstruction error and the con contractiv tractiv tractive p enalt enalty y Ω( Ω(h error manifolds alone would tthe wo CAE opposing forces. In the case of the CAE, these t wo forces are reconstruction to learn an identit identity y function. The contractiv contractivee penalty alone would error and the tractiv p enaltfeatures y Ω(h ). that Reconstruction error alone wouldto encourage x. The encourage thecon CAE toelearn are constan constant t with respect the CAE to learn an identit y function. The contractiv e p enalty alone would compromise betw between een these two forces yields an autoenco autoencoder der whose deriv derivativ ativ atives es ∂f ( x ) x encourage the CAE to learn features that are constan t with respect to . The tiny y. Only a small num numb ber of hidden units, corresponding to a ∂x are mostly tin compromise betw een these t wo forces yields an eautoenco der whose derivatives small number of directions in the input, ma may y hav have significan significant t deriv derivatives. atives. are mostly tiny. Only a small number of hidden units, corresponding to a The goal of the CAE is to learn the manifold structure of the data. Directions small number of directions in the input, may have significant derivatives. x with large J x rapidly change h, so these are likely to be directions whic which h The goal of the CAE is to learn the manifold structure of the data. Directions appro approximate ximate the tangent planes of the manifold. Exp Experiments eriments by Rifai et al. (2011a) x J x h with large rapidly c hange , so these are likely e directions h and Rifai et al. (2011b) sho show w that training the CAE resultstoin bmost singular whic values appro ximate the tangent of theand manifold. Exp by Rifai ete.al.How (2011a J dropping of below 1 inplanes magnitude therefore beriments ecoming con contractiv tractiv tractive. Howev ev ever, er,) and Rifai et al. v(alues 2011b)remain show that training the CAE in most singular values some singular ab abo ove 1, because the results reconstruction error penalty J of dropping magnitude and therefore tractiv e. Howev er, encourages thebelow CAE 1toinencode the directions withbecoming the mostcon local variance. The some singular v alues remain ab o ve 1 , because the reconstruction error p enalty directions corresp corresponding onding to the largest singular values are interpreted as the tangen tangentt encouragesthat the CAE totractive encode the directions with the most local, vthese ariance. Thet directions the con contractive autoenco autoencoder der has learned. Ideally Ideally, tangen tangent directions corresp to thetolargest interpreted as the tangen directions shouldonding corresp correspond ond real vsingular ariationsvalues in theare data. For example, a CAEt directions that the contractive autoenco der has , these tangen applied to images should learn tangent vectors thatlearned. show ho how wIdeally the image changes ast directions should corresp ond to real v ariations in the data. F or example, a CAE ob objects jects in the image gradually change pose, as shown in Fig. 14.6. Visualizations of applied to images learnsingular tangentvectors vectorsdo that show how theond image changes as the exp experimen erimen erimentally tallyshould obtained seem to corresp correspond to meaningful ob jects in the image gradually changeaspose, as shown Fig. .14.6. Visualizations of transformations of the input image, shown in Fig.in14.10 the experimentally obtained singular vectors do seem to correspond to meaningful One practicalofissue with the CAE transformations the input image, as regularization shown in Fig. criterion 14.10. is that although it is cheap to compute in the case of a single hidden lay layer er autoenco autoencoder, der, it becomes One practical issue the of CAE regularization criterion is that although it much more exp expensiv ensiv ensive e in with the case deep deeper er auto autoenco enco encoders. ders. The strategy follo followed wed by is cheap to (compute case of train a single hidden layer autoenco der, itders, becomes Rifai et al. 2011a) is in to the separately a series of single-la single-lay yer auto autoenco enco encoders, each m uch more exp ensiv e in the case of deep er auto enco ders. The strategy follo wed by trained to reconstruct the previous auto autoenco enco encoder’s der’s hidden lay layer. er. The comp composition osition Rifai et al. ( 2011a ) is to separately train a series of single-la y er auto enco ders, each of these auto autoenco enco encoders ders then forms a deep auto autoenco enco encoder. der. Because each la layer yer was trained to reconstruct previous autoenco hidden lay er. der Theiscomp osition separately trained to bthe e locally contractiv contractive, e, der’s the deep auto autoenco enco encoder con contractive tractive of these autoresult encoders then enco Because each latraining yer was as well. The is not theforms same aasdeep whatauto would beder. obtained by jointly separately trained to bewith locally contractiv e, the deep auto enco der is con tractive the en entire tire arc architecture hitecture a penalty on the Jacobian of the deep mo model, del, but it as w ell. The result is not the same as what would b e obtained by jointly training captures man many y of the desirable qualitativ qualitativee characteristics. the entire architecture with a penalty on the Jacobian of the deep model, but it Another practical is that the contraction penalty can obtain useless results captures man y of the issue desirable qualitativ e characteristics. Another practical issue is that the contraction penalty can obtain useless results 525
CHAPTER 14. AUTOENCODERS
Input poin ointt Input point
Tangent vectors Tangent vectors
Lo Local cal PCA (no sharing across regions) Local PCA (no sharing across regions) Con Contractive tractive autoenco autoencoder der Figure 14.10: Illustration of tangent vectors local cal PCA Contractive autoenco der of the manifold estimated by lo and by a contractiv contractivee auto autoenco enco encoder. der. The lo location cation on the manifold is defined by the input Figure of14.10: tangent vectors of the The manifold estimated cal PCA image a dogIllustration drawn fromofthe CIF CIFAR-10 AR-10 dataset. tangent vectors by are lo estimated and by leading a contractiv e auto encoder. The locationmatrix on the∂h manifold is defined by the input b y the singular vectors of the Jacobian of the input-to-co input-to-code de mapping. ∂x image of a dog drawn from the CIF AR-10 dataset. The tangent vectors are estimated Although both lo local cal PCA and the CAE can capture lo local cal tangents, the CAE is able to by themore leading singular vectors from of thelimited Jacobian matrix the input-to-co mapping. form accurate estimates training data bofecause it exploitsdeparameter Although both different local PCA and thethat CAEshare can capture tangents, theunits. CAE The is able to sharing across lo locations cations a subsetloofcalactiv active e hidden CAE form more accurate estimates from limited training data b ecause it exploits parameter tangen tangentt directions typically corresp correspond ond to moving or changing parts of the ob object ject (suc (such h as sharing different lo cations that share a subset of active hidden units. The CAE the headacross or legs). tangent directions typically correspond to moving or changing parts of the ob ject (such as the head or legs).
if we do not impose some sort of scale on the decoder. For example, the enco encoder der could consist of multiplying the input by a small constant and the deco decoder der if w e do not impose some sort of scale on the decoder. F or example, the enco der could consist of dividing the code by . As approac approaches hes 0, the enco encoder der drives the could consist of multiplying the input by a small constant and the deco der h) to approac con contractive tractive penalty Ω( Ω(h approach h 0 without having learned an anything ything ab about out the could consist of dividing the by . As approac hes 0 , the enco der drives the distribution. Meanwhile, thecode deco main p erfect reconstruction. In Rifai decoder der maintains tains h) to ted conal. tractive approac h 0 without having ything about the f and g are et (2011ap),enalty this isΩ(preven prevented by tying the weights of flearned and g.anBoth distribution. Meanwhile, theersdeco der main perfect reconstruction. In Rifai standard neural netw network ork lay layers consisting oftains an affine transformation follo followed wed by f g f g et al. ( 2011a ), this is preven ted by tying the w eights of and . Both and are an elemen element-wise t-wise nonlinearity nonlinearity,, so it is straightforw straightforward ard to set the weigh eightt matrix of g standard neural netw ork lay ers consisting of an affine transformation followed by to be the transp transpose ose of the weigh weightt matrix of f . an element-wise nonlinearity, so it is straightforward to set the weight matrix of g to be the transpose of the weight matrix of f .
14.8
Predictiv Predictive e Sparse Decomp Decomposition osition
Pr Pre edictivePredictiv sp sparse arse de deccomp omposition osition (PSD) is a mo model del that is a h hybrid ybrid of sparse 14.8 e Sparse Decomp osition co coding ding and parametric auto autoenco enco encoders ders (Ka Kavukcuoglu vukcuoglu et al. al.,, 2008). A parametric Pr edictive arse deto comp osition model that is a hybrid enco encoder der is sp trained predict the(PSD) outputisofa iterative inference. PSD of hassparse been co ding and parametric auto enco ders ( Ka vukcuoglu et al. , 2008 ). A parametric applied to unsupervised feature learning for ob object ject recognition in images and video enco der is trained to predict the output of iterative inference. een (Ka Kavukcuoglu vukcuoglu et al. al.,, 2009, 2010; Jarrett et al. al.,, 2009; Farab arabet et et al. al.,,PSD 2011has ), asbwell applied to unsupervised feature for del ob ject recognition in images video as for audio (Henaff et al. , 2011learning ). The mo model consists of an enco encoder der f (and a x) and (deco Ka vukcuoglu et al. , 2009 , 2010 ; Jarrett et al. , 2009 ; F arab et et al. , 2011 ), as well decoder der g ( h) that are both parametric. During training, h is controlled by the as for audio (Henaff et al., 2011). The model consists of an encoder f (x) and a decoder g ( h) that are both parametric. 526During training, h is controlled by the
CHAPTER 14. AUTOENCODERS
optimization algorithm. Training pro proceeds ceeds by minimizing optimization algorithm. ||x T −raining g(h)|| 2pro + λceeds |h|1 +byγ ||minimizing h − f (x)||2 .
(14.19)
x training g(h) algorithm + λ h +alternates γ h f (xb)et . (14.19) Lik Likee in sparse co coding, ding, the etween ween minimization with || − || resp | ect | to the || mo − del parameters. || resp respect ect to h and minimization with respect model Minimization Lik e in sparse co ding, the training algorithm alternates b et ween minimization with respect to h is fast because f (x) pro provides vides a go goo od initial value of h andwith the respect to h and minimization with resp ect fto(xthe mo del parameters. Minimization h to remain cost function constrains near ) an . Simple gradien descent anywa ywa yway y gradientt (xas with respectreasonable to h is fast because ) pro a go od initial value of h and the can obtain values of hfin fewvides as ten steps. cost function constrains h to remain near f (x) anyway. Simple gradient descent The training pro procedure cedure used by PSD is different from first training a sparse can obtain reasonable values of h in as few as ten steps. co coding ding mo model del and then training f ( x) to predict the values of the sparse coding The training protraining cedure used by PSD is different firsttotraining a sparse features. The PSD pro regularizes thefrom deco use parameters procedure cedure decoder der coding mo ) to predict the values of the sparse coding f(x for whic which h fdel (x)and can then infer training goo good d co code de values. features. The PSD training procedure regularizes the decoder to use parameters Predictiv Predictivee sparse coding is an example of le learne arne arned d appr approximate oximate infer inferenc enc encee. In Sec. for which f (x) can infer good code values. 19.5 19.5,, this topic is dev develop elop eloped ed further. The tools presented in Chapter 19 make it e sparse is an example of leaarne d appr oximate inferprobabilistic ence. In Sec. clearPredictiv that PSD can becoding interpreted as training directed sparse co coding ding 19.5 , this is develop ederfurther. The presented in the Chapter 19 make it mo model del by topic maximizing a low lower bound on thetools log-likelihoo log-likelihood d of mo model. del. clear that PSD can be interpreted as training a directed sparse coding probabilistic In practical applications of bound PSD, the iterative optimization is only used during model by maximizing a lower on the log-likelihoo d of the model. training. The parametric enco encoder der f is used to compute the learned features when practical applications of PSD, iterative optimization is onlycompared used during f the the In mo model del is deploy deployed. ed. Ev Evaluating aluating is computationally inexpensive to f training. The parametric enco der is used to compute the learned features when inferring h via gradien gradientt descent. Because f is a differen differentiable tiable parametric function, f the mo del is deploy ed. Ev aluating is computationally inexpensive compared to PSD mo models dels may be stac stacked ked and used to initialize a deep netw network ork to be trained h f inferring via gradien t descent. Because is a differen tiable parametric function, with another criterion. PSD models may be stacked and used to initialize a deep network to be trained with another criterion.
14.9
Applications of Auto Autoenco enco encoders ders
Auto Autoenco encoders ders hav havee been successfully applied to ders dimensionalit dimensionality y reduction and infor14.9enco Applications of Auto enco mation retriev retrieval al tasks. Dimensionalit Dimensionality y reduction was one of the first applications Auto encoders havelearning been successfully applied toItdimensionalit y reduction andations inforof represen representation tation and deep learning. was one of the early motiv motivations mation retriev al tasks. Dimensionalit y reduction wasSalakh one ofutdinov the first applications for studying auto autoenco enco encoders. ders. For example, Hinton and Salakhutdinov (2006 ) trained of represen tation learning and deep learning. It was one of the early motiv ations a stac stack k of RBMs and then used their weigh eights ts to initialize a deep auto autoenco enco encoder der for studying auto enco ders. F or example, Hinton and Salakh utdinov ( 2006 ) trained with gradually smaller hidden la layers, yers, culminating in a bottleneck of 30 units. The a stack ofco RBMs and less thenreconstruction used their weigh to initialize deep autoencoand der resulting code de yielded errortsthan PCA intoa 30 dimensions with gradually smaller hidden culminating in to a bottleneck of 30 units.toThe the learned representation waslayers, qualitatively easier interpret and relate the resulting co de yielded less reconstruction error than PCA into 30 dimensions and underlying categories, with these categories manifesting as well-separated clusters. the learned representation was qualitatively easier to interpret and relate to the Lo Low wer-dimensional represen representations tations can improv improvee performance on man many y tasks, underlying categories, with these categories manifesting as well-separated clusters. suc such h as classification. Mo Models dels of smaller spaces consume less memory and run runtime. time. Lo w er-dimensional represen tations can improv e p erformance on man y tasks, Man Many y forms of dimensionality reduction place seman semantically tically related examples near such as classification. Models of smaller spaces consume less memory and runtime. Many forms of dimensionality reduction527 place semantically related examples near
CHAPTER 14. AUTOENCODERS
eac each h other, as observed by Salakhutdino Salakhutdinov v and Hin Hinton ton (2007b) and Torralba et al. (2008). The hints provided by the mapping to the lo lower-dimensional wer-dimensional space aid eac h other, as observed by Salakhutdino v and Hin ton ( 2007b) and Torralba et al. generalization. (2008). The hints provided by the mapping to the lower-dimensional space aid One task that benefits ev even en more than usual from dimensionality reduction generalization. is information retrieval etrieval,, the task of finding entries in a database that resemble a One task that b enefits eventhe more than usual from query en entry try try.. This task derives usual benefits from dimensionality dimensionalit dimensionality y reduction reduction is information r etrieval , the task of finding entries in a database that resemble a that other tasks do, but also deriv derives es the additional benefit that search can become query entryefficient . This task deriveskinds the usual benefits from dimensionalit y reduction extremely in certain of low dimensional spaces. Sp Specifically ecifically ecifically, , if that other tasks do, but also deriv es the additional benefit that search can b ecome we train the dimensionalit dimensionality y reduction algorithm to pro produce duce a code that is lowextremely efficient in certain low dimensional , if dimensional and binary , then kinds we canofstore all database spaces. en entries tries inSpaecifically hash table w e train the dimensionalit y reduction algorithm to pro duce a code is lowmapping binary co code de vectors to entries. This hash table allows us that to perform dimensional retriev and binary , then we all candatabase store allen database tries in asame hashbinary table information retrieval al by returning entries tries thatenhav have e the mapping binary co de v ectors to entries. This hash table allows us to p erform co code de as the query query.. We can also search ov over er sligh slightly tly less similar entries very information retriev by returning all database tries that havofe the efficien efficiently tly tly,, just by al flipping individual bits fromen the enco encoding ding the same query query..binary This co de as the query . W e can also search ov er sligh tly less similar entries very approac approach h to information retriev retrieval al via dimensionality reduction and binarization efficien tlysemantic , just by flipping bits from enco ding of the query . This is called hashing individual (Salakh Salakhutdinov utdinov and the Hin Hinton ton , 2007b , 2009b ), and has approac h to information retriev al via dimensionality reduction and binarization been applied to both textual input (Salakhutdino Salakhutdinov v and Hinton, 2007b, 2009b) and is called(Tsemantic and; Hin ton, 2007b , 2009b ), , and images orralba ethashing al. al.,, 2008(Salakh ; Weissutdinov et al. al.,, 2008 Krizhevsky and Hinton 2011has ). been applied to both textual input (Salakhutdinov and Hinton, 2007b, 2009b) and To pro produce duce binary codes des ;for one typically an enco encoding ding images (Torralba et al.co , 2008 Wsemantic eiss et al.,hashing, 2008; Krizhevsky anduses Hinton , 2011 ). function with sigmoids on the final la lay yer. The sigmoid units must be trained to be To proto duce binary des for semantic hashing, uses encoding saturated nearly 0 orco nearly 1 for all input values.one Onetypically trick that canan accomplish function with sigmoids the final layer. sigmoid units must be trainedduring to be this is simply to inject on additiv additive e noise justThe before the sigmoid nonlinearity saturated to nearly 0 or nearly 1 for all input values. One ov trick that can training. The magnitude of the noise should increase over er time. Toaccomplish figh fightt that this is simply to inject additiv e noise just b efore the sigmoid nonlinearity during noise and preserv preservee as muc uch h information as possible, the net netw work must increase the training. The magnitude of the noise should increase ov er time. T o figh t that magnitude of the inputs to the sigmoid function, un until til saturation occurs. noise and preserve as much information as possible, the network must increase the The ideaof of learning function has bun een explored in several magnitude the inputs atohashing the sigmoid function, tilfurther saturation occurs. directions, including the idea of training the represen representations tations so as to optimize The idea of learning a hashing function has b een further explored several a loss more directly link linked ed to the task of finding nearb nearby y examples in in the hash directions, including the idea of training the represen tations so as to optimize table (Norouzi and Fleet, 2011). a loss more directly linked to the task of finding nearby examples in the hash table (Norouzi and Fleet, 2011).
528
Chapter 15 Chapter 15
Represen Representation tation Learning tation Represen Learning In this chapter, we first discuss what it means to learn represen representations tations and ho how w the notion of represen representation tation can be useful to design deep architectures. We discuss In chapter, we first share discussstatistical what it means to across learn represen and how ho how wthis learning algorithms strength differenttations tasks, including the notion of represen tation canervised be useful to to design deep sup architectures. We Shared discuss using information from unsup unsupervised tasks p erform supervised ervised tasks. ho w learning algorithms share statistical strength across different tasks, including represen representations tations are useful to handle multiple mo modalities dalities or domains, or to transfer using information from unsup ervised tasks to p sup ervised tasks. learned kno knowledge wledge to tasks for whic which h few or noerform examples are giv given en but Shared a task representations are useful to handle multiple mo dalities or domains, or to transfer represen representation tation exists. Finally Finally, , we step bac back k and argue about the reasons for the learned of knorepresen wledge tation to tasks for whicstarting h few orwith no examples are givadv en but a task success representation learning, the theoretical advantages antages of represen tation exists. Finally , w e step bac k and argue about the reasons for the distributed representations (Hin Hinton ton et al. al.,, 1986) and deep represen representations tations and success of represen tation learning, starting with the theoretical adv of ending with the more general idea of underlying assumptions aboutantages the data distributedpro representations (Hinab tonout et underlying al., 1986) and deep represen tations and generating process, cess, in particular about causes of the observed data. ending with the more general idea of underlying assumptions about the data Man Many y information processing tasks can be very easy or very difficult dep depending ending generating pro cess, in particular ab out underlying causes of the observed data. on how the information is represented. This is a general principle applicable to Man y information processing tasks can easy or very difficult dep ending daily life, computer science in general, andbetovery machine learning. For example, it onstraightforw how the information represented. principle applicable to is straightforward ard for a piserson to divideThis 210 is bya6general using long division. The task daily life,considerably computer science in general, andifto For example, it b ecomes less straigh straightforw tforw tforward ard it machine is insteadlearning. p osed using the Roman is straightforw ard for a p erson to divide 210 by 6 using long division. The task numeral representation of the num numb b ers. Most mo modern dern p eople asked to divide CCX b ecomes considerably less straigh tforw ard if it is using the Roman by VI would b egin by con conv verting the num umb b ers to theinstead Arabicpnosed umeral represen representation, tation, numeral representation the numb ers. p eople divide More CCX p ermitting long divisionofpro procedures cedures that Most makemo usedern of the placeasked valuetosystem. by VI would b egin y converting numb ers run to the Arabic numeral representation, concretely concretely, , we canbquantify the the asymptotic runtime time of various op operations erations using p ermitting long division pro cedures that make use of the place v alue system. More appropriate or inappropriate represen representations. tations. For example, inserting a num umb b er concretely , we can quantify the asymptotic run time of v arious op erations using in into to the correct p osition in a sorted list of num numbers bers is an O(n) op operation eration if the appropriate or inappropriate represen tations. For a numasb er O(log n) example, list is represented as a linked list, but only if the list inserting is represented a O ( n in to the correct p osition in a sorted list of num bers is an ) op eration if the red-blac red-black k tree. list is represented as a linked list, but only O(log n) if the list is represented as a In the context of mac machine hine learning, what mak makes es one representation b etter than red-black tree. In the context of machine learning, 529 what makes one representation b etter than 529
CHAPTER 15. REPRESENTATION LEARNING
another? Generally sp speaking, eaking, a go gooo d represen representation tation is one that mak makes es a subsequen subsequentt learning task easier. The choice of represen representation tation will usually dep depend end on the choice another? Generallylearning sp eaking, a go o d representation is one that makes a subsequent of the subsequent task. learning task easier. The choice of representation will usually dep end on the choice Wesubsequent can think of feedforward networks orks trained by sup supervised ervised learning as p erof the learning task. netw forming a kind of represen representation tation learning. Sp Specifically ecifically ecifically,, the last lay layer er of the netw network ork We can think of classifier, feedforward netw trained by sup ervised learning is typically a linear such as orks a softmax regression classifier. The as restp erof forming a kind of represen tation learning. Sp ecifically , the last lay er of the netw ork the net netw work learns to pro provide vide a represen representation tation to this classifier. Training with a is t ypically a linear classifier, such as a softmax regression classifier. rest of sup supervised ervised criterion naturally leads to the representation at ev every ery hiddenThe la lay yer (but the net w ork learns to pro vide a represen tation to this classifier. T raining with more so near the top hidden lay layer) er) taking on prop properties erties that make the classificationa sup ervised naturally leads that to thewere representation at separable every hidden layerinput (but task easier.criterion For example, classes not linearly in the more so near top hidden layseparable er) takinginonthe prop erties thatlay make theprinciple, classification features ma may y the b ecome linearly last hidden layer. er. In the task easier. F or example, classes that were not linearly separable in the input last lay layer er could be another kind of mo model, del, such as a nearest neigh neighb b or classifier features ma y b ecome linearly separable in the last hidden lay er. In principle, the (Salakh Salakhutdino utdino utdinov v and Hinton, 2007a). The features in the p en enultimate ultimate lay layer er should last lay er couldprop be erties another kind of on mothe del, type suchofasthe a nearest neighb or classifier learn different properties dep depending ending last lay layer. er. (Salakhutdinov and Hinton, 2007a). The features in the p enultimate layer should Sup Supervised ervised prop training of feedforward netw networks do does es the not last inv involve olve explicitly imposing learn different erties dep ending on theorks type of layer. an any y condition on the learned in intermediate termediate features. Other kinds of representation Sup ervised training of feedforward netw orks do estonot inveolve imposing learning algorithms are often explicitly designed shap shape theexplicitly represen representation tation in any condition learned intermediate kinds of representation some particularonwthe ay. F or example, supposefeatures. we wan wantt Other to learn a representation that learning algorithms are often designed shapindep e theendences represenare tation in mak makes es density estimation easier.explicitly Distributions withtomore independences easier some particular ay. For example, suppose we want that to learn a representation that to mo model, del, so we w could design an ob objectiv jectiv jectivee function encourages the elements mak es density estimation easier. Distributions with more indep endences are easier of the representation vector h to b e indep independent. endent. Just like sup supervised ervised net netw works, to mo del, so wdeep e could designalgorithms an ob jectivhav e function that encourages thee elements unsup unsupervised ervised learning have e a main training ob objectiv jectiv jective but also h to effect. of thearepresentation b e indep endent. Just like asup ervised networks, learn representationvector as a side Regardless of how representation was unsup ervised deep learning algorithms hav e a main training ob jectiv e but also obtained, it can can b e used for another task. Alternatively Alternatively,, multiple tasks (some learn a representation as a side effect. Regardless of how representation was sup supervised, ervised, some unsup unsupervised) ervised) can b e learned together withasome shared in internal ternal obtained,tation. it can can b e used for another task. Alternatively, multiple tasks (some represen representation. sup ervised, some unsup ervised) can b e learned together with some shared internal Mosttation. represen representation tation learning problems face a tradeoff b etw etween een preserving as represen muc uch h information ab about out the input as p ossible and attaining nice prop properties erties (such Most represen tation learning problems face a tradeoff b etw een preserving as as indep independence). endence). much information ab out the input as p ossible and attaining nice prop erties (such Represen Representation tation learning is particularly in interesting teresting b ecause it provides one as indep endence). way to p erform unsup unsupervised ervised and semi-sup semi-supervised ervised learning. We often hav havee very Represen tation learning is particularly in teresting b ecause it provides one large amoun amounts ts of unlabeled training data and relatively little lab labeled eled training w ay toTpraining erform with unsupsup ervised andlearning semi-sup ervised Weeled often haveoften very data. supervised ervised tec techniques hniqueslearning. on the lab labeled subset large amoun ts ofov unlabeled training ervised data and relatively eledtotraining results in severe overfitting. erfitting. Semi-sup Semi-supervised learning offerslittle the clab hance resolve data. T raining with sup ervised learning tec hniques on the lab eled subset often, this overfitting problem by also learning from the unlab unlabeled eled data. Sp Specifically ecifically ecifically, results severe Semi-sup ervised learning offers to resolve w e can in learn go goo oov d erfitting. represen representations tations for the unlabeled data,the andchance then use these this o v erfitting problem by also learning from the unlab eled data. Sp ecifically , represen representations tations to solve the sup supervised ervised learning task. we can learn go o d representations for the unlabeled data, and then use these Humans and animals are able to learn from very few lab labeled eled examples. We do representations to solve the sup ervised learning task. Humans and animals are able to learn 530 from very few lab eled examples. We do
CHAPTER 15. REPRESENTATION LEARNING
not yet kno know w how this is p ossible. Many factors could explain improv improved ed human p erformance—for example, the brain may use very large ensembles of classifiers notBay yetesian knowinference how this techniques. is p ossible. Many could explain improv edbrain human or Bayesian One pfactors opular hypothesis is that the is p erformance—for example, the brain may use very large ensembles of classifiers able to leverage unsup unsupervised ervised or semi-sup semi-supervised ervised learning. There are man many y ways or Bay esian inference techniques. One p opular hypothesis is that the brain is to lev leverage erage unlab unlabeled eled data. In this chapter, we fo focus cus on the hypothesis that the able to leverage unsup semi-sup learning. unlab unlabeled eled data can b e ervised used to or learn a go goo oervised d represen representation. tation. There are many ways to leverage unlab eled data. In this chapter, we fo cus on the hypothesis that the unlab eled data can b e used to learn a go o d representation.
15.1
Greedy La Lay yer-Wise Unsup Unsupervised ervised Pretraining
Unsup Unsupervised learning play yed a key historical in the reviv revival al of deep neural 15.1 ervised Greedy Lapla yer-Wise Unsuprole ervised Pretraining net netw works, allo allowing wing for the first time to train a deep sup supervised ervised netw network ork without Unsup ervised learning pla y ed a key historical role in the reviv al of neural requiring arc architectural hitectural sp specializations ecializations like conv convolution olution or recurrence. deep We call this net w orks, allo wing for the first time to train a deep sup ervised netw ork without pro procedure cedure unsup unsupervise ervise ervised d pr pretr etr etraining aining aining,, or more precisely precisely,, gr greeedy layer-wise unsup unsupererrequiring arc hitectural sp ecializations like conv olution or recurrence. W e call this vise vised d pr pretr etr etraining aining aining.. This pro procedure cedure is a canonical example of how a representation pro cedure ervise(unsup d pretrervised aining, learning, or more precisely , gr eedy layer-wise unsup erlearned forunsup one task (unsupervised trying to capture the shape of the vised pr etraining. This cedure isbea useful canonical exampletask of how a representation input distribution) can pro sometimes for another (supervised learning learned for one task (unsup ervised learning, trying to capture the shape of the with the same input domain). input distribution) can sometimes be useful for another task (supervised learning Greedy la lay yer-wise unsup unsupervised ervised pretraining relies on a single-la single-lay yer represenwith the same input domain). tation learning algorithm such as an RBM, a single-lay single-layer er auto autoencoder, encoder, a sparse Greedy layer-wise unsup ervised relies on a tations. single-laEach yer represenco coding ding mo model, del, or another model thatpretraining learns laten latent t represen representations. lay layer er is tation learning algorithm such as an RBM, a single-lay er auto encoder, a sparse pretrained using unsup unsupervised ervised learning, taking the output of the previous la lay yer co ding mo del, or another model that learns laten t represen tations. Each lay er is and pro producing ducing as output a new represen representation tation of the data, whose distribution (or pretrained unsup ervisedsuch learning, taking the output ofis the previous layer its relation using to other variables as categories to predict) hop hopefully efully simpler. and Algorithm pro ducing 15.1 as output a new represen tation of the data, whose distribution (or See for a formal description. its relation to other variables such as categories to predict) is hop efully simpler. lay layer-wise er-wise pro procedures cedures based on unsup unsupervised ervised criteria hav havee long See Greedy Algorithm 15.1 fortraining a formal description. b een used to sidestep the difficult difficulty y of jointly training the la lay yers of a deep neural net layer-wise pro cedures on least unsup hav e long for aGreedy sup supervised ervised task. training This approach datesbased back at aservised far as criteria the Neo Neocognitron cognitron used to, 1975 sidestep thedeep difficult y of jointly training the labyegan ers ofwith a deep net (bFeen ukushima ). The learning renaissance of 2006 theneural discov discovery ery for a sup ervised task. This approach dates back at least as far as the Neo cognitron that this greedy learning pro procedure cedure could b e used to find a go good od initialization for (aFjoint ukushima , 1975 ). The deep learning renaissance of 2006 b egan withcould the discov ery learning pro procedure cedure ov over er all the lay layers, ers, and that this approach b e used that this greedy learning pro cedure couldarchitectures b e used to find aton go od initialization for to successfully train even fully connected (Hin Hinton et al. al., , 2006; Hinton a joint learning pro overton all, 2006 the lay ers, and this ;approach e used). and Salakhutdino Salakhutdinov v,cedure 2006; Hin Hinton ; Bengio et that al. al.,, 2007 Ranzato could et al. al.,,b2007a to successfully trainery even fully connected (Hin etorks al., 2006 ; Hinton Prior to this discov discovery ery, , only conv convolutional olutionalarchitectures deep netw networks orks orton netw networks whose depth and Salakhutdino v , 2006 ; Hin ton , 2006 ; Bengio et al. , 2007 ; Ranzato et al. , 2007a ). resulted from recurrence were regarded as feasible to train. To da day y, we no now w know Priorgreedy to this lay discov ery, only convolutional netwto orks or netw whose depth that layer-wise er-wise pretraining is not deep required train fullyorks connected deep resulted from recurrence were regarded as feasible to train. T o da y , w e no w know arc architectures, hitectures, but the unsup unsupervised ervised pretraining approach was the first metho method d to that greedy lay er-wise pretraining is not required to train fully connected deep succeed. architectures, but the unsup ervised pretraining approach was the first metho d to Greedy lay layer-wise er-wise pretraining is called gr greeedy because it is a gr greeedy algorithm, succeed. 531 greedy because it is a greedy algorithm, Greedy layer-wise pretraining is called
CHAPTER 15. REPRESENTATION LEARNING
meaning that it optimizes eac each h piece of the solution indep independen enden endently tly tly,, one piece at a time, rather than jointly optimizing all pieces. It is called layer-wise b ecause these meaning that it optimizes hyers piece of the solution indep enden , one la piece at a indep independen enden endent t pieces are theeac la lay of the net netw work. Specifically Specifically, , tly greedy lay yer-wise time, ratherpro than jointly all pieces. It isthe called k-thlayer-wise pretraining proceeds ceeds one optimizing la lay yer at a time, training lay layer er whileb ecause keepingthese the indep enden t pieces are the la y ers of the net w ork. Specifically , greedy la y er-wise previous ones fixed. In particular, the lo low wer la lay yers (whic (which h are trained first) are not k-th pretraining pro ceeds onelay laers yerare at in a tro time, training layer while theh adapted after the upper layers intro troduced. duced. It is the called unsup unsupervise ervise ervised d bkeeping ecause eac each previous ones fixed. the representation lower layers (whic h are trained first)Ho are not la lay yer is trained with In anparticular, unsup unsupervised ervised learning algorithm. How wev ever er adapted after the upper lay ers are in tro duced. It is called unsup ervise d b ecause eac h it is also called pr pretr etr etraining aining aining,, b ecause it is supp supposed osed to b e only a first step b efore la y er is trained with an unsup ervised representation learning algorithm. Ho w ev er a joint training algorithm is applied to fine-tune all the la lay yers together. In the it istext alsoofcalled pr etraining , b ecause supp to bas e only a first step efore con context a sup supervised ervised learning task,ititiscan b eosed viewed a regularizer (inbsome a joint training algorithmdecreases is appliedtest to fine-tune all the layers together. the exp experimen erimen eriments, ts, pretraining error without decreasing training In error) context of aofsup ervised learning task, it can b e viewed as a regularizer (in some and a form parameter initialization. exp eriments, pretraining decreases test error without decreasing training error) It is common to use the word “pretraining” to refer not only to the pretraining and a form of parameter initialization. stage itself but to the entire tw two o phase proto protocol col that com combines bines the pretraining It is common to use the word “pretraining” to refer not only to thema pretraining phase and a sup supervised ervised learning phase. The sup supervised ervised learning phase may y in inv volv olvee stage itself but to the entire tw o phase proto col that com bines the pretraining training a simple classifier on top of the features learned in the pretraining phase, phase and ainv sup ervised learningfine-tuning phase. The of supthe ervised learning phase may in or it may involve olve sup supervised ervised entire net netw work learned involv thee training a simple on top of the features learned in learning the pretraining phase, pretraining phase.classifier No matter what kind of unsup unsupervised ervised algorithm or or it may inv olve sup ervised fine-tuning of the entire net w ork learned in the what mo model del type is employ employed, ed, in the vast ma majority jority of cases, the overall training pretraining phase. No matter what kind of unsup ervised learning algorithm or sc scheme heme is nearly the same. While the choice of unsup unsupervised ervised learning algorithm what mo del type is employ ed, in the v ast ma jority of cases, the o v erall training will ob obviously viously impact the details, most applications of unsup unsupervised ervised pretraining sc heme is nearly the same. follo follow w this basic proto protocol. col. While the choice of unsup ervised learning algorithm will obviously impact the details, most applications of unsup ervised pretraining Greedy la lay yer-wise unsup unsupervised ervised pretraining can also b e used as initialization follow this basic proto col. for other unsup unsupervised ervised learning algorithms, suc such h as deep auto autoenco enco encoders ders (Hin Hinton ton Greedy la y er-wise unsup ervised pretraining can also b e used as initialization and Salakhutdino Salakhutdinov v, 2006) and probabilistic mo models dels with many la lay yers of laten latentt other unsup ervised learning algorithms, suc h as deep auto ders (Hin ton vfor ariables. Suc Such h mo models dels include deep b elief net netw w orks (Hinton et al.enco , 2006 ) and deep and Salakhutdino v, 2006 ) and probabilistic mo with many deep layersgenerative of latent Boltzmann mac machines hines (Salakh Salakhutdino utdino utdinov v and Hin Hinton tondels , 2009a ). These vmo ariables. Suc mo dels ed include deep b20 elief models dels will b eh describ described in Chapter . networks (Hinton et al., 2006) and deep Boltzmann machines (Salakhutdinov and Hinton, 2009a). These deep generative As discussed in Sec. 8.7.4, it is also p ossible to ha hav ve greedy la lay yer-wise sup superermo dels will b e describ ed in Chapter 20. vise vised d pretraining. This builds on the premise that training a shallo shallow w netw network ork is As than discussed in Sec. 8.7.4one, , it iswhich also pseems ossibletotohav haeveb een greedy layer-wise sup ereasier training a deep have validated in sev several eral vise d pretraining. builds con contexts texts (Erhan etThis al., 2010 ). on the premise that training a shallow network is easier than training a deep one, which seems to have b een validated in several contexts (Erhan et al., 2010).
15.1.1
When and Wh Why y Do Does es Unsup Unsupervised ervised Pretraining Work?
15.1.1 and Wh y Does Unsup ervised Pretraining Work?tial On man many y When tasks, greedy lay layer-wise er-wise unsup unsupervised ervised pretraining can yield substan substantial impro improv vemen ements ts in test error for classification tasks. This observ observation ation was resp responsible onsible On man y tasks, greedy lay er-wise unsup ervised pretraining can yield substan for the renewed interested in deep neural net netw works starting in 2006 (Hin Hinton ton et tial al. al.,, improvements in test error for classification tasks. This observation was resp onsible for the renewed interested in deep neural networks starting in 2006 (Hinton et al., 532
CHAPTER 15. REPRESENTATION LEARNING
Algorithm 15.1 Greedy lay layer-wise er-wise unsup unsupervised ervised pretraining protocol protocol.. Giv Given en the following: Unsupervised feature learning algorithm L, which tak takes es a Algorithm 15.1 Greedy lay er-wise unsup ervised pretraining protocol . training set of examples and returns an enco encoder der or feature function f . The raw Giv en the following: Unsupervised feature learning algorithm , which tak es a input data is X , with one ro row w p er example and f (1)(X ) is the output of the first f . The training set examples anddataset returnsused an enco dersecond or feature raw L ervised stage enco encoder derofon by the level function unsup unsupervised feature X and the input data is Xcase , with one fine-tuning row p er example and f w(eXuse ) isathe output of the first T whic learner. In the where is p erformed, learner which h takes stage encofunction der on Xf,and the examples dataset used by theinsecond level unsup ervised feature X (and an initial input the sup supervised ervised fine-tuning case, learner. In the case where fine-tuning is p erformed, w e use a learner whic h takes asso associated ciated targets Y ), and returns a tuned function. The num umber ber of stages is m. an initial function f, input examples X (and in the sup ervised fine-tuning case, T f ← Iden Identit tit tity y function asso˜ciated targets Y ), and returns a tuned function. The number of stages is m. X=X f kIden for = 1,tit . .y. ,function m do ˜← X = X ( k ) ˜ f = L(X ) forf k←=f1(k, ). .◦. f, m do ˜˜) f˜ ←=f (k()X (X ) X f f f L end for ˜← f ◦(X ˜) if X fine-tuning then end for ←T (f , X , Y ) f← then if fine-tuning end if f Return f(f , X , Y ) end ← if T Return f 2006; Bengio et al. al.,, 2007; Ranzato et al. al.,, 2007a). On many other tasks, how howev ev ever, er, unsup unsupervised ervised pretraining either do does es not confer a b enefit or even causes noticeable 2006 ; Bengio et al. , 2007 ; Ranzato et al. , 2007a On many other tasks, how ever, harm. Ma et al. (2015) studied the effect of ).pretraining on machine learning unsup ervised pretraining either do es not confer a b enefit or even causes noticeable mo models dels for chemical activit activity y prediction and found that, on av average, erage, pretraining was harm. Ma et al.but (2015 ) studied of pretraining on machine learning sligh slightly tly harmful, for man many y tasksthe waseffect significantly helpful. Because unsup unsupervised ervised mo dels for chemical activit y prediction and found that, on av erage, pretraining was pretraining is sometimes helpful but often harmful it is imp importan ortan ortantt to understand slightly harmful, but forinman y tasks was significantly Becausetounsup ervised when and wh why y it w orks order to determine whether helpful. it is applicable a particular pretraining is sometimes helpful but often harmful it is imp ortan t to understand task. when and why it works in order to determine whether it is applicable to a particular At the outset, it is imp important ortant to clarify that most of this discussion is restricted task. to greedy unsup unsupervised ervised pretraining in particular. There are other, completely A t the outset, it for is imp ortant to semi-sup clarify that mostlearning of this discussion is restricted differen differentt paradigms performing semi-supervised ervised with neural net netw works, to greedy unsup ervised pretraining in particular. There are other, completely suc such h as virtual adversarial training describ described ed in Sec. 7.13. It is also p ossible to differen paradigms semi-sup ervised neural netmo works, train ant auto autoenco enco encoder derfor or performing generative model at the samelearning time as with the sup supervised ervised model. del. suc h as virtual adversarial training describ ed in Sec. 7.13 . It is also p ossible to Examples of this single-stage approac approach h include the discriminative RBM (Laro Larochelle chelle trainBengio an auto enco)der generative model the same as the sup ervised del. and , 2008 andorthe ladder netw network orkat (Rasmus et time al., 2015 ), in which themo total Examples of this single-stage approac h include the discriminative RBM ( Laro chelle ob objective jective is an explicit sum of the two terms (one using the lab labels els and one only and Bengio , 2008) and the ladder network (Rasmus et al., 2015), in which the total using the input). ob jective is an explicit sum of the two terms (one using the lab els and one only Unsup Unsupervised ervised pretraining com combines bines two differen differentt ideas. First, it makes use of using the input). Unsup ervised pretraining combines533 two different ideas. First, it makes use of
CHAPTER 15. REPRESENTATION LEARNING
the idea that the choice of initial parameters for a deep neural netw network ork can ha hav ve a significan significantt regularizing effect on the mo model del (and, to a lesser extent, that it can the idea that the choiceSecond, of initial parameters deep general neural netw have impro improv ve optimization). it makes use offor theamore idea ork thatcan learning a significan t regularizing effect the to molearn del (and, a lesser extent, it can ab about out the input distribution canonhelp ab about outtothe mapping fromthat inputs to impro v e optimization). Second, it makes use of the more general idea that learning outputs. ab out the input distribution can help to learn ab out the mapping from inputs to Both of these ideas inv involv olv olvee man many y complicated in interactions teractions b et etw ween several outputs. parts of the machine learning algorithm that are not entirely understo understoo o d. Both of these ideas involve many complicated interactions b etween several The first idea, that the choice of initial parameters for a deep neural netw network ork parts of the machine learning algorithm that are not entirely understo o d. can ha hav ve a strong regularizing effect on its performance, is the least well understo understooo d. Thetime firstthat idea,pretraining that the choice of pinitial parameters for a odeep netwthe ork At the b ecame opular, it was understo understoo d as neural initializing can havine aa strong regularizing performance, the minim least wum ell rather understo o d. mo model del lo location cation that wouldeffect causeon it its to approach one is lo local cal minimum than A t the time b ecame was understo initializing the another. T To othat day day,, pretraining lo local cal minima are nop opular, longer it considered to boedaasserious problem mo del in a lo cation that w ould cause it to approach one lo cal minim um rather than for neural netw network ork optimization. We now kno know w that our standard neural net netw work another.pro Tocedures day, lo cal minima arearrive no longer considered e a serious training procedures usually do not at a critical p ointtoofbany kind. It problem remains for neuralthat netw ork optimization. Wethe now kno our standard neuralotherwise network p ossible pretraining initializes mo model delwinthat a lo location cation that would training pro cedures usually do not arrive at a critical p oint of any kind. It remains b e inaccessible—for example, a region that is surrounded by areas where the cost p ossible that pretraining initializes the mo del in a lo cation that would otherwise function varies so muc much h from one example to another that minibatc minibatches hes give only inaccessible—for a region or that is surrounded by areas where the cost ab every noisy estimateexample, of the gradient, a region surrounded by areas where the function v aries so muc h from one example to another that minibatc hes give only Hessian matrix is so p o orly conditioned that gradient descent metho methods ds must use a v ery noisy estimate of the gradient, or a region surrounded b y areas where the very small steps. Ho How wev ever, er, our ability to characterize exactly what asp aspects ects of the Hessian matrix is so p o orly conditioned that gradient descent metho ds must use pretrained parameters are retained during the sup supervised ervised training stage is limited. very is small wev er,dern our approaches ability to characterize what asp ectservised of the This one steps. reason Ho that mo modern typically useexactly simultaneous unsup unsupervised pretrainedand parameters arelearning retained rather during than the sup ervised training stage is limited. learning sup supervised ervised tw two o sequential stages. One ma may y This is one reason that mo dern approaches t ypically use simultaneous unsup ervised also avoid struggling with these complicated ideas ab about out ho how w optimization in the learning and sup ervised learning rather than tw o sequential stages. may sup supervised ervised learning stage preserves information from the unsup unsupervised ervisedOne learning also avboid struggling with these complicatedforideas out how optimization the stage y simply freezing the parameters the ab feature extractors and in using sup ervised learning learning only stagetopreserves information thelearned unsup ervised sup supervised ervised add a classifier on topfrom of the features.learning stage by simply freezing the parameters for the feature extractors and using other idea, that algorithm information learned in the sup The ervised learning only atolearning add a classifier on can top use of the learned features. unsup unsupervised ervised phase to p erform b etter in the sup supervised ervised learning stage, is b etter The other idea, that a is learning algorithm in the understo understoo o d. The basic idea that some featurescan thatuse areinformation useful for thelearned unsup unsupervised ervised unsup ervised to pfor erform b etter in the sup ervised stage,if is etter task may also phase b e useful the sup supervised ervised learning task. learning For example, webtrain o d. The basic idea is that some features that areituseful for the aundersto generative mo model del of images of cars and motorcycles, will need tounsup knowervised ab about out task may also b e useful for the sup ervised learning task. F or example, if we train wheels, and ab about out how many wheels should b e in an image. If we are fortunate, a generative mo delofofthe images of will carstake and on motorcycles, need ab out the representation wheels a form thatit iswill easy for to theknow sup supervised ervised wheels, andaccess. ab outThis how ismany should e in an image. If wtheoretical e are fortunate, learner to not ywheels et understo understoo o d bat a mathematical, lev level, el, the representation of the wheels will take on a form that is easy for the sup ervised so it is not alwa always ys p ossible to predict which tasks will b enefit from unsup unsupervised ervised learner to access. This is not y et understo o d at a mathematical, theoretical lev el, learning in this way. Man Many y aspects of this approac approach h are highly dep dependen enden endent t on so itsp isecific not alwa ys p ossible which will enefit from unsup ervised the specific mo models dels used. Fto orpredict example, if wtasks e wish to badd a linear classifier on learning in this way. Many aspects of this approach are highly dep endent on the sp ecific mo dels used. For example, if we wish to add a linear classifier on 534
CHAPTER 15. REPRESENTATION LEARNING
top of pretrained features, the features must make the underlying classes linearly separable. These prop properties erties often o ccur naturally but do not alwa always ys do so. This top of pretrained features, the features m ust make the underlying classes is another reason that sim simultaneous ultaneous sup supervised ervised and unsup unsupervised ervised learninglinearly can b e separable. These prop erties often o ccur naturally but do not alwa ys do so. This preferable—the constrain constraints ts imp imposed osed by the output lay layer er are naturally included is another reason that sim ultaneous sup ervised and unsup ervised learning can be from the start. preferable—the constraints imp osed by the output layer are naturally included From the p oin ointt of view of unsup unsupervised ervised pretraining as learning a represen representation, tation, from the start. we can exp expect ect unsup unsupervised ervised pretraining to b e more effective when the initial F rom the p oin t of view of unsup ervised pretraining as use learning a represen tation, represen representation tation is p o or. One key example of this is the of word em emb b eddings. w e can exp ect unsup ervised pretraining to b e more effective when the initial Words represented by one-hot vectors are not very informative b ecause every two represenone-hot tation isvectors p o or. are One example of from this is theother use of word em eddings. L2bdistance distinct thekey same distance each (squared W represented one-hot are enco not very informative ecause every two of ords 2). Learned wordby em emb b eddingsvectors naturally encode de similarit similarity y b et etw wbeen words by their distinct one-hot vectors are Because the sameofdistance fromervised each other (squaredisL esp distance distance from eac each h other. this, unsup unsupervised pretraining especially ecially of 2 ). Learned w ord em b eddings naturally enco de similarit y b et w een w ords b y their useful when pro processing cessing words. It is less useful when pro processing cessing images, p erhaps distance from eac h other.lieBecause this, unsup isvide esp ecially b ecause images already in a ric rich hofvector spaceervised where pretraining distances pro provide a lo low w useful when pro cessing w ords. It is less useful when pro cessing images, p erhaps qualit quality y similarit similarity y metric. b ecause images already lie in a rich vector space where distances provide a low From the p oin ointt of view of unsup unsupervised ervised pretraining as a regularizer, we can quality similarity metric. exp expect ect unsup unsupervised ervised pretraining to b e most helpful when the num umb b er of lab labeled eled From the p oin t of view of unsup ervisedofpretraining a regularizer, we can examples is very small. Because the source informationasadded by unsup unsupervised ervised exp ect unsup pretraining b emay mostalso helpful when the numb erpretraining of lab eled pretraining iservised the unlab unlabeled eled data,towe exp expect ect unsupervised examples is very the of eled information added by to p erform best small. when Because the n num um umber ber source of unlab unlabeled examples is v very eryunsup large.ervised The pretraining is the unlab eled data, we may also exp ect unsupervised pretraining adv advantage antage of semi-sup semi-supervised ervised learning via unsup unsupervised ervised pretraining with man many y to p erform best when ber unlab eledwas examples is very large. unlab unlabeled eled examples andthe fewnum lab labeled eledofexamples made particularly clearThe in advantage of semi-sup learning via unsup pretraining with many 2011 with unsup unsupervised ervisedervised pretraining winning two ervised in international ternational transfer learning unlab eled examples lab;eled examples made particularly clearthe in comp competitions etitions (Mesniland et al.few , 2011 Go Goo o dfellow et w al.as, 2011 ), in settings where 2011 with unsup ervised pretraining winning t w o in ternational transfer learning num umb b er of lab labeled eled examples in the target task was small (from a handful to dozens comp etitionsp(erMesnil al., 2011 ; Gowere o dfellow al., 2011), in in carefully settings where the of examples class).etThese effects also et documented controlled numerimen b er of ts labbeled examples the).target task was small (from a handful to dozens exp experimen eriments y Paine et al. in (2014 of examples p er class). These effects were also documented in carefully controlled likely b e in inv olved. ed. For example, unsup unsupervised ervised pretraining expOther erimenfactors ts by Pare ainelik etely al.to(2014 ).volv is likely to b e most useful when the function to b e learned is extremely complicated. Other factors are lik ely tofrom b e in volved. Forlik example, Unsup Unsupervised ervised learning differs regularizers like e weigh eightt unsup deca decay y ervised b ecausepretraining it do does es not is likely to b e most useful when the function to b e learned is extremely complicated. bias the learner tow toward ard discov discovering ering a simple function but rather to tow ward disco discov vering Unsup ervised learning differs from regularizers lik e w eigh t deca y b ecause it do not feature functions that are useful for the unsup unsupervised ervised learning task. If theestrue bias the learner tow ard discov ering a simple function but rather to w ard disco v ering underlying functions are complicated and shap shaped ed by regularities of the input feature functions that are learning useful forcan theb eunsup ervised learningregularizer. task. If the true distribution, unsup unsupervised ervised a more appropriate underlying functions are complicated and shap ed by regularities of the input These ca cav veats aside, we now analyze some success cases where unsup unsupervised ervised distribution, unsup ervised learning can b e a more appropriate regularizer. pretraining is known to cause an improv improvement, ement, and explain what is kno known wn ab about out These ca v eats aside, we now analyze some success cases where unsup ervised wh why y this improv improvemen emen ementt occurs. Unsup Unsupervised ervised pretraining has usually b een used pretraining known toand cause an improv ement, and explain kno wnview ab out to impro improv ve is classifiers, is usually most interesting from what the pisoin oint t of of why this improvement occurs. Unsup ervised pretraining has usually b een used to improve classifiers, and is usually most 535 interesting from the p oint of view of
CHAPTER 15. REPRESENTATION LEARNING
1500
With pretraining Without pretraining
1000
With pretraining Without pretraining
500 0 −500 −1000 −1500 −4000 −3000 −2000 −1000
0
1000
2000
3000
4000
Figure 15.1: Visualization via nonlinear pro projection jection of the learning tra trajectories jectories of different neural netw networks orks in function space (not parameter space, to avoid the issue of man many-toy-toFigure 15.1: Visualization via nonlinear jection ofwith the learning jectories of different one mappings from parameter vectors topro functions), differenttra random initializations space pretraining. neural netw in function (not parameter space, tot acorresp void the issue y-toand with ororks without unsup unsupervised ervised Each p oin oint corresponds onds to ofa man differen different t one mappings from to functions), withpro different random initializations neural net network, work, at aparameter particularvectors time during its training process. cess. This figure is adapted and with or without unsup ervised pretraining. Each p oin t correspspace onds to a differen with permi permission ssion from Erhan et al. (2010 ). A co coordinate ordinate in function is an infinite-t neural net work, at a particular time during its training pro cess. This figure is adapted dimensional vector asso associating ciating ev every ery input x with an output y . Erhan et al. (2010) made with permi ssion from Erhan et al. (2010 ). A by co ordinate in function is an infinitea linear pro projection jection to high-dimensional space concatenating the y space for many sp specific ecific x et al. ((T y . Erhan dimensional ciating everynonlinear input x with an output 2010 ) made p oin oints. ts. Theyvector then asso made a further pro projection jection to 2-D by Isomap enenbaum y x a linear pro jection to high-dimensional space by concatenating the for many sp ecific et al. al.,, 2000). Color indicates time. All net netw works are initialized near the center of the plot p oints. onding They then made a further nonlinear pro jection toximately 2-D by Isomap Tenenbaum (corresp (corresponding to the region of functions that pro produce duce appro approximately uniform (distributions al.,the 2000 ). Color time. All nettime, workslearning are initialized the centerout ofwthe y for indicates oetver class most inputs). Ov Over er mov moves es near the function outw ard,plot to (corresp onding to the region of functions that pro duce appro ximately uniform distributions p oin oints ts that make strong predictions. Training consistently terminates in one region when over the class y forand most er time, learning movwhen es thenot function ward, to using pretraining in inputs). another, Ov non-o non-ov verlapping region using out pretraining. p oints that stronge global predictions. Training consistently regionregion when Isomap triesmake to preserv preserve relative distances (and henceterminates volumes) in so one the small using pretraining and in another, non-o v erlapping region when not using pretraining. corresp corresponding onding to pretrained mo models dels may indicate that the pretraining-based estimator Isomap tries vto preserve global relative distances (and hence volumes) so the small region has reduced ariance. corresp onding to pretrained mo dels may indicate that the pretraining-based estimator has reduced variance.
536
CHAPTER 15. REPRESENTATION LEARNING
reducing test set error. Ho How wev ever, er, unsup unsupervised ervised pretraining can help tasks other than classification, and can act to improv improvee optimization rather than b eing merely test F set How er,improv unsupeervised pretraining help tasks other areducing regularizer. or error. example, it ev can improve b oth train and testcan reconstruction error than classification, and can act to improv e optimization rather than b eing merely for deep auto autoenco enco encoders ders (Hin Hinton ton and Salakhutdino Salakhutdinov v, 2006). a regularizer. For example, it can improve b oth train and test reconstruction error Erhan et al. (2010) p erformed many exp experimen erimen eriments ts to explain sev several eral successes of for deep auto enco ders (Hinton and Salakhutdinov, 2006). unsup unsupervised ervised pretraining. Both impro improv vemen ements ts to training error and impro improv vemen ements ts Erhan et al. ( 2010 ) p erformed many exp erimen ts to explain sev eral successes of to test error ma may y b e explained in terms of unsup unsupervised ervised pretraining taking the unsup ervised pretraining. Both impro v emen ts to training error and impronetw vemen ts parameters into a region that would otherwise b e inaccessible. Neural network ork to test error ma y b e explained in terms of unsup ervised pretraining taking the training is non-deterministic, and con conv verges to a different function every time it parameters into amay region otherwise e inaccessible. orkt is run. T Training raining haltthat at a would p oint where the b gradient b ecomesNeural small,netw a p oin oint training is non-deterministic, and con vergestto aerfitting, differentorfunction every time it where early stopping ends training to preven prevent ov overfitting, at a p oint where the is run. tTis raining mayithalt at a p oint where the gradient b ecomes small, a phoin gradien gradient large but is difficult to find a downhill step due to problems suc such ast where early stopping ends training to preven t ov erfitting, or at a p oint where the sto stocchasticit hasticity y or p o or conditioning of the Hessian. Neural net netw works that receive gradien t is large but it is difficult totly find a downhill step region due toofproblems h as unsup unsupervised ervised pretraining consisten consistently halt in the same functionsuc space, sto chasticit or porks o or conditioning of the consistently Hessian. Neural netanother works that receive while neuralynetw networks without pretraining halt in region. See unsup ervised pretraining consisten tly halt in the same region of function space, Fig. 15.1 for a visualization of this phenomenon. The region where pretrained while neural netw without pretraining halt in another region. net netw works arriv arrive e isorks smaller, suggesting that consistently pretraining reduces the variance of See the Fig. 15.1 for a cess, visualization of in this phenomenon. The of region pretrained estimation pro process, which can turn reduce the risk sev severe erewhere ov over-fitting. er-fitting. In net w orks arriv e is smaller, suggesting that pretraining reduces the v ariance of the other words, unsup unsupervised ervised pretraining initializes neural netw network ork parameters into estimation pro cess, which can in turn reduce the risk of sev ereinitialization over-fitting. are In a region that they do not escap escape, e, and the results following this other w ords, unsup ervised pretraining initializes neural netw ork parameters into more consisten consistentt and less likely to b e very bad than without this initialization. a region that they do not escap e, and the results following this initialization are Erhan et al. (2010) also provide some answers as to when pretraining works more consistent and less likely to b e very bad than without this initialization. b est—the mean and variance of the test error were most reduced by pretraining for when Erhan etorks. al. (2010 also provide as towere pretraining works deep deeper er net netw w Keep) in mind that some these answers exp experiments eriments p erformed b efore the b est—the mean and v ariance of the test error w ere most reduced by pretraining for in inv ven ention tion and p opularization of mo modern dern tec techniques hniques for training very deep netw networks orks deep er net works. Keepdrop in mind that these exp eriments were p erformed efore the (rectified linear units, dropout out and batch normalization) so less is known bab about out the invention and pervised opularization of moin dern techniqueswith for training vorary ery deep netwhes. orks effect of unsup unsupervised pretraining conjunction con contemp temp temporary approac approaches. (rectified linear units, drop out and batch normalization) so less is known ab out the Anofimp importan ortan ortant t question is how unsup unsupervised ervised pretraining can act as aapproac regularizer. effect unsup ervised pretraining in conjunction with contemp orary hes. One hyp ypothesis othesis is that pretraining encourages the learning algorithm to discov discover er An imp ortan t question how unsup ervised can the act as a regularizer. features that relate to theis underlying causes pretraining that generate observed data. One is hyp is that pretraining the learning algorithm to discov er This anothesis imp importan ortan ortant t idea motiv motivating atingencourages man many y other algorithms b esides unsup unsupervised ervised features thatand relate to theedunderlying causes that generate the observed data. pretraining, is describ described further in Sec. 15.3 . This is an imp ortant idea motivating many other algorithms b esides unsup ervised Compared to other wa ways ys of incorp incorporating orating this b elief by using unsup unsupervised ervised pretraining, and is describ ed further in Sec. 15.3. learning, unsupervised pretraining has the disadv disadvantage antage that it op operates erates with to otherphases. ways of One incorp orating thisthese b elieftwboy training using unsup ervised twoCompared separate training reason that phases are learning, unsupervised pretraining has the disadv antage that it op erates with disadv disadvantageous antageous is that there is not a single hyp yperparameter erparameter that predictably treduces wo separate training phases. One reason that these twofrom training phases are or increases the strength of the regularization arising the unsup unsupervised ervised disadvantageous is that there is not a single hyp erparameter thateffect predictably pretraining. Instead, there are very many hyperparameters, whose ma may y be reduces or increases the strength of the regularization arising from the unsup ervised pretraining. Instead, there are very many 537 hyperparameters, whose effect may b e
CHAPTER 15. REPRESENTATION LEARNING
measured after the fact but is often difficult to predict ahead of time. When we p erform unsup unsupervised ervised and sup supervised ervised learning simultaneously simultaneously,, instead of using the measured after the fact but is often difficult to predict ahead ofefficient time. When we pretraining strategy strategy,, there is a single hyp yperparameter, erparameter, usually a co coefficient attac attached hed p erform unsup ervised andthat sup ervised learning , instead of using the to the unsup unsupervised ervised cost, determines how simultaneously strongly the unsup unsupervised ervised ob objective jective pretraining strategy , there is a single h yp erparameter, usually a co efficient attac hed will regularize the supervised model. One can alw alwa ays predictably obtain less to the unsup ervised cost, that strongly unsup ervised ob jective regularization by decreasing thisdetermines co coefficien efficien efficient. t.how In the case ofthe unsup unsupervised ervised pretraining, will the supervised model. the Onestrength can alwof ays obtain less thereregularize is not a wa way y of flexibly adapting thepredictably regularization—either regularization y decreasing this co efficien t. In theparameters, case of unsup pretraining, the sup supervised ervisedbmo model del is initialized to pretrained orervised it is not. there is not a way of flexibly adapting the strength of the regularization—either Another disadv disadvantage antage of having tw twoo separate training phases is that eac each h phase the sup ervised mo del is initialized to pretrained parameters, or it is not. has its own hyp yperparameters. erparameters. The p erformance of the second phase usually cannot Another disadv antage of having twso o separate phases is wthat h phase b e predicted during the first phase, there is training a long dela delay y b et etw een eac prop proposing osing has its own h yp erparameters. The p erformance of the second phase usually cannot hyp yperparameters erparameters for the first phase and b eing able to up update date them using feedback b e predicted during the first phase, so there is a long dela y bvet ween prop from the second phase. The most principled approac approach h is to use alidation set osing error hypthe erparameters the in first phase b eing able to up date them using feedback in sup supervised ervised for phase order to and select the hyperparameters of the pretraining from the phase. Thechelle most etprincipled h is to some use validation set error phase, as second discussed in Laro Larochelle al. (2009approac ). In practice, hyperparameters, in ethe ervised in orderiterations, to select the of the lik like thesup num umb b er ofphase pretraining arehyperparameters more con conv venien eniently tly set pretraining during the phase, as discussed in Laro chelle et al. ( 2009 ). In practice, some h yperparameters, pretraining phase, using early stopping on the unsup unsupervised ervised ob objectiv jectiv jective, e, which is like ideal the nbut umbcomputationally er of pretrainingmuc iterations, arethan more conthe venien tly set during the not much h cheap heaper er using sup supervised ervised ob objectiv jectiv jective. e. pretraining phase, using early stopping on the unsup ervised ob jective, which is o da day ybut , unsup unsupervised ervised pretraining een largely except in the not Tideal computationally much chas heapber than usingabandoned, the sup ervised ob jectiv e. field of natural language processing, where the natural representation of words as To davectors y, unsup ervised pretraining b een largely abandoned, except in eled the one-hot conv conveys eys no similarity has information and where very large unlab unlabeled fieldare of natural language where natural representation of wpretrain ords as sets available. In that processing, case, the adv advantage antagethe of pretraining is that one can one-hot eys no set similarity information wherecontaining very large billions unlab eled once on vectors a huge conv unlab unlabeled eled (for example with and a corpus of sets are a v ailable. In that case, the adv antage of pretraining is that one can pretrain words), learn a go gooo d represen representation tation (t (typically ypically of words, but also of sentences), and once on a h uge unlab eled set (for example corpus containing billionsthe of then use this representation or fine-tune it with for a asup supervised ervised task for which w ords), learn a go o d represen tation fewer (typically of words, butapproach also of sentences), and training set con contains tains substantially examples. This was pioneered then thisert representation fine-tune it foretaal. sup(2010 ervised taskCollob for which b y byuse Collob Collobert and Weston or (2008b ), Turian ), and Collobert ert etthe al. training set con tains substantially fewer examples. This approach was pioneered (2011a) and remains in common use to toda da day y. by by Collob ert and Weston (2008b), Turian et al. (2010), and Collob ert et al. Deep learning tec techniques hniques based on sup supervised ervised learning, regularized with dropout (2011a) and remains in common use to day. or batch normalization, are able to achiev achievee human-lev human-level el p erformance on very many Deep learning tec hniques based on sup ervised learning, dropout tasks, but only with extremely large labeled datasets. regularized These samewith techniques or batch normalization, able to achiev e human-level datasets p erformance many outp outperform erform unsup unsupervised ervisedare pretraining on medium-sized such on as very CIF CIFAR-10 AR-10 tasks, but only with extremely large labeled datasets. These same techniques and MNIST, which hav havee roughly 5,000 lab labeled eled examples p er class. On extremely outp erform unsup ervised medium-sized as CIF AR-10 small datasets, such as thepretraining alternativ alternativeeonsplicing dataset,datasets Bay Bayesian esiansuch metho methods ds outp outpererand MNIST, which hav e roughly 5,000 lab eled examples p er class. On extremely form metho methods ds based on unsupervised pretraining (Sriv Srivasta asta astav va, 2013). For these small datasets, such as the alternativ e splicing dataset, Bay esian metho ds outp erreasons, the p opularity of unsup unsupervised ervised pretraining has declined. Nevertheless, form metho ds based on unsupervised pretraining ( Sriv asta v a , 2013 ). For unsup unsupervised ervised pretraining remains an imp important ortant milestone in the history of these deep reasons, research the p opularity of unsup pretraining has approac declined. learning and contin continues ues to ervised influence contemporary approaches. hes.Nevertheless, The idea of unsup ervised pretraining remains an imp ortant milestone in the history of deep learning research and continues to influence contemporary approaches. The idea of 538
CHAPTER 15. REPRESENTATION LEARNING
pretraining has b een generalized to sup supervise ervise ervised d pr pretr etr etraining aining discussed in Sec. 8.7.4, as a very common approach for transfer learning. Supervised pretraining for pretraining has biseen generalized to et supal. ervise d ;prYetr aininget discussed transfer learning p opular (Oquab , 2014 osinski al. al.,, 2014)in forSec. use 8.7.4 with, as avolutional very common approach for on transfer learning.dataset. Supervised pretraining for con conv netw networks orks pretrained the ImageNet Practitioners publish transfer learning is p opular ( Oquab et al. , 2014 ; Y osinski et al. , 2014 ) for use with the parameters of these trained netw networks orks for this purp purpose, ose, just like pretrained word volutional networksforpretrained on the ImageNet dataset. publish vcon ectors are published natural language tasks (Collob Collobert ert etPractitioners al., 2011a; Mik Mikolo olo olov v the parameters of these trained netw orks for this purp ose, just like pretrained word et al. al.,, 2013a). vectors are published for natural language tasks (Collob ert et al., 2011a; Mikolov et al., 2013a).
15.2
Transfer Learning and Domain Adaptation
T ransfer learning and domain adaptation to the situation where what has b een 15.2 Transfer Learning andrefer Domain Adaptation learned in one setting (i.e., distribution P1) is exploited to improv improvee generalization T ransfer learning and domain adaptation refer to the situation where whatted hasinb een in another setting (sa (say y distribution P2 ). This generalizes the idea presen presented the P learned in one setting (i.e., distribution ) is exploited to improv e generalization previous section, where we transferred representations b etw etween een an unsup unsupervised ervised P ). task. in another setting y distribution This generalizes the idea presented in the learning task and a(sa sup supervised ervised learning previous section, where we transferred representations b etween an unsup ervised In tr transfer ansfer le learning arning arning,, the learner must p erform two or more different tasks, learning task and a sup ervised learning task. but we assume that man many y of the factors that explain the variations in P1 are In tr ansfer le arning , the must p erform wo or more different tasks, P2 . This relev relevant ant to the variations that learner need to b e captured for tlearning is typically P are but we assume many learning of the factors that explain the vis ariations in but understo understood od in a that sup supervised ervised context, where the input the same the P relev ant to the v ariations that need to b e captured for learning . This is t ypically target ma may y b e of a different nature. For example, we may learn ab about out one set of understo od in a sup ervised learning context, where the input is the sameab but visual categories, suc such h as cats and dogs, in the first setting, then learn about outthe a target ma y b e of a different nature. F or example, we may learn ab out one set of differen differentt set of visual categories, such as ants and wasps, in the second setting. If visual sucmore h as cats dogs, in setting the first setting, from then Plearn ab out a there iscategories, significan significantly tly dataand in the first (sampled 1), then that differen t set visual categories, such ants and to wasps, in the second setting. If ma may y help to of learn representations thatasare useful quic quickly kly generalize from only P is significan morefrom dataP in. Many the first setting (sampled from ), then that vthere ery few examplestly drawn visual categories lo low-lev w-lev w-level el notions share 2 ma y help to learn representations that are useful to quic kly generalize from of edges and visual shap shapes, es, the effects of geometric changes, changes in lighting,only etc. vIn ery few examples from visual categories low-levadaptation el notions P . Manylearning general, transferdrawn learning, multi-task (Sec. 7.7), share and domain of edges and visual shap es, thetation effects learning of geometric changes, changes in lighting, etc. can b e achiev achieved ed via represen representation when there exist features that are In general, transfer learning, multi-task (Sec. 7.7to ), underlying and domainfactors adaptation useful for the different settings or tasks,learning corresp corresponding onding that can b e achiev ed via represen tation learning when there exist features that are app appear ear in more than one setting. This is illustrated in Fig. 7.2, with shared lo low w er useful for the different settings orlay tasks, la lay yers and task-dep task-dependen enden endent t upp upper er layers. ers. corresp onding to underlying factors that app ear in more than one setting. This is illustrated in Fig. 7.2, with shared lower Ho How w ev ever, er, sometimes, shared differentt tasks is not the layers and task-dep endent what upp erislay ers. among the differen seman semantics tics of the input but the semantics of the output. For example, a sp speec eec eech h Ho w ev er, sometimes, what is shared among the differen t tasks is not the recognition system needs to produce valid sen sentences tences at the output la lay yer, but seman tics of butinput the semantics output.very Fordifferent example, a sp eecof h the earlier la lay ythe ers input near the may need of to the recognize versions recognition system needs to produce vvocalizations alid sentences at the output layer, but the same phonemes or sub-phonemic dep depending ending on which p erson thespeaking. earlier layIn erscases near like the input need more to recognize different versions of is these, may it makes sense tovery share the upp upper er la lay yers the same phonemes or sub-phonemic v ocalizations dep ending on which p erson (near the output) of the neural net netw work, and hav havee a task-sp task-specific ecific prepro preprocessing, cessing, as is speaking. In cases like these, it makes more sense to share the upp er layers (near the output) of the neural network,539 and have a task-sp ecific prepro cessing, as
CHAPTER 15. REPRESENTATION LEARNING
illustrated in Fig. 15.2. illustrated in Fig. 15.2. y
h(shared)
Selection switch
h(1)
h(2)
x(1)
x(2)
h(3)
x(3)
Figure 15.2: Example architecture for multi-task or transfer learning when the output variable y has the same seman semantics tics for all tasks while the input variable x has a different Figure 15.2: Example architecture for multi-task transfer learning when the output meaning (and p ossibly even a different dimension)orfor each task (or, for example, each y x (1) (2) (3) variablecalled hasxthe, same seman all tasks while input vels ariable different user), tasks. Thethe low lower er lev levels (up tohas thea selection x and x ticsforforthree meaning (and p ossibly even a different dimension) each task example, each switc switch) h) are task-sp task-specific, ecific, while the upp upper er levels arefor shared. The(or, lo low wfor er levels learn to x x x user), called , and for three tasks. The low er lev els (up to the selection translate their task-sp task-specific ecific input into a generic set of features. switch) are task-sp ecific, while the upp er levels are shared. The lower levels learn to translate task-sp ecific input intoadaptation a generic set of task features. In thetheir related case of domain , the (and the optimal input-to-
output mapping) remains the same b etw etween een eac each h setting, but the input distribution In the related case of domain adaptation , the task the optimal input-tois sligh slightly tly different. For example, consider the task of (and sentimen sentiment t analysis, which output mapping) remains the same b etween each setting, bute the input distribution consists of determining whether a commen comment t expresses p ositiv ositive or negative sentimen sentiment. t. is sligh tly different. F or example, consider the task of sentimen t analysis, which Commen Comments ts p osted on the web come from many categories. A domain adaptation consists of determining whether a commen t expresses p ositiv negativereviews sentimenoft. scenario can arise when a sen sentimen timen timent t predictor trained one or customer Commen ts p osted on the web come from many categories. A domain adaptation media conten contentt suc such h as b o oks, videos and music is later used to analyze commen comments ts scenario can arise when a sen timen t predictorortrained on customer reviews of ab about out consumer electronics such as televisions smartphones. One can imagine mediathere conten h as b o oks, videos and is later used to analyze iscommen ts that is tansuc underlying function thatmusic tells whether any statement p ositive, ab out consumer electronics asthe televisions or smartphones. One canfrom imagine neutral or negative, but of such course vocabulary and style may vary one that there is an underlying function that tells whether any statement is p ositive, domain to another, making it more difficult to generalize across domains. Simple neutral or negative, but (with of course the vocabulary and has style mayfound varytofrom one unsup unsupervised ervised pretraining denoising auto autoenco enco encoders) ders) b een b e very domain to for another, making it more difficult toadaptation generalize across Simple successful sentimen sentiment t analysis with domain (Glorotdomains. et al., 2011b ). unsup ervised pretraining (with denoising auto enco ders) has b een found to b e very A related problem is that of conc oncept ept drift drift,, whic which h we can view as a form of transfer successful for sentiment analysis with domain adaptation (Glorot et al., 2011b). learning due to gradual changes in the data distribution ov over er time. Both concept related problem is that of bconc ept drift which we can view a form oflearning. transfer driftAand transfer learning can e viewed as, particular forms of as multi-task learning due to gradual changes in the data distribution over time. Both concept drift and transfer learning can b e viewed540 as particular forms of multi-task learning.
CHAPTER 15. REPRESENTATION LEARNING
While the phrase “m “multi-task ulti-task learning” typically refers to sup supervised ervised learning tasks, the more general notion of transfer learning is applicable to unsup unsupervised ervised learning While the phrase “m ulti-task learning” t ypically refers to sup ervised learning tasks, and reinforcemen reinforcementt learning as well. the more general notion of transfer learning is applicable to unsup ervised learning all of theset cases, theasob objectiv jectiv jectivee is to take adv advan an antage tage of data from the first andIn reinforcemen learning well. setting to extract information that may b e useful when learning or even when In allmaking of these cases, theinobthe jectiv e is to take adv antage of data from the first directly predictions second setting. The core idea of representation setting to that mayma bye buseful when or even when learning is extract that theinformation same representation may e useful in blearning oth settings. Using the directly making predictions in the second setting. The core idea of representation same representation in b oth settings allows the represen representation tation to b enefit from the learning is that the same representation ma y b e useful in b oth settings. Using the training data that is av available ailable for b oth tasks. same representation in b oth settings allows the representation to b enefit from the As men mentioned tioned efore, unsup unsupervised ervised learning for transfer learning has found training data thatbis available for b othdeep tasks. success in some machine learning competitions (Mesnil et al., 2011; Go Goo o dfello dfellow w As men tioned b efore, unsup ervised deep learning for transfer learning has found et al. al.,, 2011). In the first of these competitions, the experimental setup is the success in some learning (Mesnil al.,first 2011setting ; Go o dfello w follo following. wing. Eac Each h machine participan participant t is firstcompetitions giv given en a dataset frometthe (from et al., 2011).P ), In the first of these competitions, the experimental setup is the distribution participants ts 1 illustrating examples of some set of categories. The participan follo wing. Eac h participan t is first giv en a dataset from the first setting (from must use this to learn a go goo o d feature space (mapping the ra raw w input to some distribution ), illustrating examples of some set of categories. The participan ts P represen representation), tation), such that when we apply this learned transformation to inputs must the use transfer this to setting learn a (distribution go o d featurePspace (mapping the raw input to some from 2 ), a linear classifier can b e trained and representation), suchvery thatfew when we apply this learned to results inputs generalize well from lab labeled eled examples. One of transformation the most striking P ),arc from the setting ahitecture linear classifier e trained found in transfer this comp competition etition(distribution is that as an architecture makes can use bof deeper and and generalize w ell from v ery few lab eled examples. One of the most striking results deep deeper er represen representations tations (learned in a purely unsupervised way from data collected found in this comp that as curve an arcon hitecture use of and P 1), theislearning in the first setting, etition the newmakes categories of deeper the second deep er represen (learned inh ab etter. purelyFunsupervised way from data P2 b ecomes (transfer) settingtations muc uch or deep representations, few fewer ercollected lab labeled eled P in the first setting, ), the learning curve on the new categories of the second examples of the transfer tasks are necessary to ac achiev hiev hievee the apparently asymptotic P (transfer) setting b ecomes m uc h b etter. F or deep representations, fewer lab eled generalization p erformance. examples of the transfer tasks are necessary to achieve the apparently asymptotic Two extreme forms of transfer learning are one-shot le learning arning and zer zero-shot o-shot generalization p erformance. le learning arning, sometimes also called zer zero-data o-data le learning arning. Only one lab labeled eled example of wo extreme one-shot learning and zer the T transfer task isforms given of fortransfer one-shotlearning learning,are while no lab labeled eled examples areo-shot giv given en le arning , sometimes also called zer o-data le arning . Only one lab eled example of at all for the zero-shot learning task. the transfer task is given for one-shot learning, while no lab eled examples are given One-shot learning (Fei-F ei-Fei ei et al. al.,, 2006) is p ossible b ecause the represen representation tation at all for the zero-shot learning task. learns to cleanly separate the underlying classes during the first stage. During the One-shot learning Fei-Fone ei etlab al.eled , 2006 ) is p ossible b ecause transfer learning stage,(only labeled example is needed to inferthe therepresen lab label el oftation man many y learns to cleanly separate the underlying classes during the first stage. During the p ossible test examples that all cluster around the same point in representation transferThis learning stage, onlyextent one labthat eled example is needed to infercorresp the labonding el of mantoy space. works to the the factors of variation corresponding p ossible test examples thatcleanly all cluster around thethe same point in representation these in inv variances hav havee b een separated from other factors, in the learned space. This works to the extent that the factors of v ariation corresp onding to represen representation tation space, and we hav havee someho somehow w learned whic which h factors do and do not these inwhen variances have b een cleanly the other factors, in the learned matter discriminating ob objects jectsseparated of certainfrom categories. representation space, and we have somehow learned which factors do and do not As an example of a zero-shot learning setting, consider the problem of having matter when discriminating ob jects of certain categories. a learner read a large collection of text and then solv solvee ob object ject recognition problems. As an example of a zero-shot learning setting, consider the problem of having a learner read a large collection of text and 541 then solve ob ject recognition problems.
CHAPTER 15. REPRESENTATION LEARNING
It may b e p ossible to recognize a sp specific ecific ob object ject class even without having seen an image of that ob object, ject, if the text describ describes es the ob object ject well enough. F For or example, It may b e p ossible to recognize a sp ecific ob ject class even without having ha having ving read that a cat has four legs and p oin ointty ears, the learner might b e seen able an to image of that ob ject, if the text describ es the ob ject well enough. F or example, guess that an image is a cat, without having seen a cat b efore. having read that a cat has four legs and p ointy ears, the learner might b e able to Zero-data learning (Laro Larocchelle et al. al.,, 2008) and zero-shot learning (Palatucci guess that an image is a cat, without having seen a cat b efore. et al., 2009; So Socher cher et al., 2013b) are only p ossible b ecause additional information learning (Laro chelle et , 2008 ) and zero-shot learning (Pscenario alatucci has Zero-data b een exploited during training. Weal.can think of the zero-data learning et al. , 2009; So cher et al., 2013b ) are only b ecause additional x, the information as including three random variables: thep ossible traditional inputs traditional has b een exploited during training. W e can think of the zero-data learning outputs or targets y , and an additional random variable describing the scenario task, T . as including random variables: the traditional inputs xp,(ythe | xtraditional , T ) where The mo model del is three trained to estimate the conditional distribution outputs or targetsof andtask an additional random ariable describing task, T y ,the T is a description we wish the mo model del vto perform. In our the example of. (y x,vTariable The mo del iscats trained estimate conditional ) where y recognizing after to ha having ving readthe ab about out cats, thedistribution output is a pbinary T is a description of the task we wish the mo del to perform. In our example of | with y = 1 indicating “y “yes” es” and y = 0 indicating “no.” The task variable T then y recognizing cats aftertoha read out output binary variable represen represents ts questions bving e answ answered eredabsuc such h cats, as “Is the there a cat isinathis image?” If we y = 0ervised with 1 indicating “yes” andunsup indicating “no.” of The task ariable ha hav ve ay = training set containing unsupervised examples ob objects jects vthat liveTinthen the represen ts questions to b e answ ered suc h as “Is there a cat in this image?” If T. same space as T , we may b e able to infer the meaning of unseen instances of we ha v e a training set containing unsup ervised examples of ob jects that live in the In our example of recognizing cats without having seen an image of the cat, it is same space as Twe , we ableeled to infer of unseen instances of T . imp importan ortan ortant t that ha hav vmay e hadb eunlab unlabeled text the datameaning con containing taining sentences such as “cats In example of “cats recognizing cats ha hav vour e four legs” or ha hav ve p oin oint tywithout ears.” having seen an image of the cat, it is imp ortant that we have had unlab eled text data containing sentences such as “cats e represented in a wa way y that allows some sort T to haveZero-shot four legs”learning or “catsrequires have p oin ty bears.” of generalization. For example, T cannot be just a one-hot co code de indicating an Zero-shot learning represented in aa wa y that allows some sort T to b)epro ob object ject category category. . Socherrequires et al. (2013b provide vide instead distributed representation T cannot of ob generalization. beembedding just a one-hot co de indicating an of object ject categoriesFor by example, using a learned word for the word asso associated ciated ob ject category . Socher et al. (2013b) provide instead a distributed representation with each category category. . of ob ject categories by using a learned word embedding for the word asso ciated A similar phenomenon happ happens ens in machine translation (Klemen Klementiev tiev et al. al.,, 2012; with each category. Mik Mikolo olo olov v et al. al.,, 2013b; Gou Gouws ws et al. al.,, 2014): we ha hav ve words in one language, and similar phenomenon enscan in machine translation (Klemencorp tiev ora; et al.on , 2012 the A relationships b etw etween een happ words b e learned from unilingual corpora; the; Mikolohand, v et al. ; Gouws etsentences al., 2014whic ): whe relate have words and other we, 2013b ha hav ve translated which words in in one one language, language with the relationships b etw een w ords can b e learned from unilingual corp ora; on the words in the other. Even though we may not ha hav ve lab labeled eled examples translating other hand, we have X translated which Yrelate words in one language with B in language w ord A in language to word sentences , we can generalize and guess a w ords in the other. Even though we may not ha v e lab eled examples translating translation for word A b ecause we ha hav ve learned a distributed representation for A X B Y , we w ord in language to word in language generalize and guess Y , anda words in language X , a distributed represen representation tation forcan words in language translation for word b ecause we ha v e learned a distributed representation for A created a link (p (possibly ossibly tw two-wa o-wa o-way) y) relating the tw two o spaces, via training examples X , apairs w ords in language distributed represen for wordsThis in language , and consisting of matc matched hed of sentences in btation oth languages. transfer Ywill be created a link (p ossibly tw o-wa y) relating the tw o spaces, via training examples most successful if all three ingredien ingredients ts (the two representations and the relations consisting of matc hed pairs of sentences in b oth languages. This transfer will b e b et etw ween them) are learned jointly jointly.. most successful if all three ingredients (the two representations and the relations Zero-shot learning is a particular b etw een them) are learned jointly. form of transfer learning. The same principle explains how one can p erform multi-mo multi-modal dal le learning arning arning,, capturing a represen representation tation in Zero-shot learning is a particular form of transfer learning. The same principle explains how one can p erform multi-mo542 dal learning, capturing a representation in
CHAPTER 15. REPRESENTATION LEARNING
hx = fx (x) hy = fy (y )
fx fy xspace yspace xtest y test
(x, y ) pairs in the training set fx : encoder function for x fy : encoder function for y Relationship between embedded points within one of the domains Maps between representation spaces
Figure 15.3: Transfer learning b etw etween een two domains x and y enables zero-shot learning. Lab Labeled eled or unlab unlabeled eled examples of x allo allow w one to learn a representation function f x and x and yofenables Figure 15.3: Transfer learning b etw eenftyw domains learning. similarly with examples of y to learn . oEac Each h application the fx zero-shot and f y functions x f and Lab eled or unlab eled examples of allo w one to learn a representation function app appears ears as an upw upward ard arrow, with the style of the arrows indicating which function is y f f f similarly Distance with examples learn .a Eac h application ofet the andpair functions applied. in hx of spacetoprovides similarity metric b etw ween any of p oints appxears as that an upw ard with the style of distance the arrows which function is x space. Likewise, in space ma may y b earrow, more meaningful than in indicating distance h applied. Distance in space provides a similarity metric b et w een any pair of p oints in hy space pro provides vides a similarit similarity y me metric tric b etw etween een any pair of p oin oints ts in y space. Both x space. Likewise, in x space that mayfunctions b e more are meaningful of these similarity indicatedthan withdistance dotted in bidirectional arrows. distance Labeled y space. in h space provides a similarit y me tricpairs b etw een pairallow of p oin Both examples (dashed horizontal lines) are (x which onetstoinlearn a one-wa one-way y , y)any of these similarity functions are indicated with dotted bidirectional arrows. Labeled or tw two-w o-w o-wa ay map (solid bidirectional arro arrow) w) b etw etween een the representations fx( x) and the , y) which examples (dashed horizontal lines) are represen pairs (xtations allowother. one to learn a one-wa y represen representations tations ) and anchor these representations to each Zero-data learning f y (y f (,xeven ) andifthe or then two-w ay mapas(solid bidirectional arrociate w) b etw thexrepresentations y is enabled follows. One can asso associate an een image to a word no test test (y )wand represen anchor these represen to each other. Zero-data flearning image oftations that wford as ev simplytations b ecause word-representations ever er presen presented, ted, y ( ytest ) x other is then enabled as follows. can) can assobe ciate an image to a via word , even ifeen no and image-represen image-representations tations fOne related to each theymaps b etw etween x (xtest f ( y image oftation that w ord was er presen ted, although simply b ecause word-representations represen representation spaces. It ev works b ecause, that image and that word were never) f ( x ) and image-represen tations canfxbe tof yeach the related maps bto etw een paired, their resp ) ha e b een each (x related (ytestother respectiv ectiv ective e feature vectors hav vvia test ) and represen tation spaces. It works b ecause, although that image and that word were never other. Figure inspired from suggestion by Hrant Khachatrian. paired, their resp ective feature vectors f (x ) and f (y ) have b een related to each other. Figure inspired from suggestion by Hrant Khachatrian. 543
CHAPTER 15. REPRESENTATION LEARNING
one mo modality dality dality,, a representation in the other, and the relationship (in general a joint distribution) b et etw ween pairs (x, y) consisting of one observ observation ation x in one mo modality dality one mo dality , a representation in the other, and the relationship (in general a joint and another observ observation ation y in the other mo modalit dalit dality y (Sriv Srivastav astav astava a and Salakhutdino Salakhutdinov v, x , y x distribution) b et w een pairs ( ) consisting of one observ ation in one mo dality 2012 2012). ). By learning all three sets of parameters (from x to its representation, from and observation and the relationship other mo dalit y (Sriv and Salakhutdino v, y in the y to another its representation, betw between eenastav the atw two o represen representations), tations), x to and 2012 ). Byinlearning all three setsare of anchored parameters its representation, from concepts one represen representation tation in (from the other, vice-v vice-versa, ersa, allowing y to to itsmeaningfully representation, and the to relationship eenpro the two represen tations), one generalize new pairs.betw The procedure cedure is illustrated in concepts in one represen tation are anchored in the other, and vice-v ersa, allowing Fig. 15.3. one to meaningfully generalize to new pairs. The pro cedure is illustrated in Fig. 15.3.
15.3
Semi-Sup Semi-Supervised ervised Disentangling of Causal Factors
An imp important ortant questionervised ab about out representation learning is mak makes esFone repre15.3 Semi-Sup Disentangling of“what Causal actors sen sentation tation b etter than another?” One hypothesis is that an ideal representation An impin ortant ab outwithin representation learning “what ond makto es the one underrepreis one whichquestion the features the represen representation tationiscorresp correspond sentation b etter than another?” is that an ideal representation lying causes of the observ observed ed data,One withhypothesis separate features or directions in feature is one in which the features within the represen tation corresp ond to the underspace corresp corresponding onding to different causes, so that the represen representation tation disentangles the lying causes of the observ ed data, with separate features or directions in feature causes from one another. This hypothesis motiv motivates ates approaches in which we first space corresp onding to different causes, so that the representation the seek a go good od representation for p(x). Such a representation ma may y disentangles also b e a go goo od causes from onefor another. Thisphypothesis ates approaches in which we first (y | x ) if ymotiv represen representation tation computing is among the most salien salient t causes of p ( x seek a go od representation for ) . Such a representation ma y also b e a go o d x. This idea has guided a large amoun amountt of deep learning research since at least p ( y represen for computing ) if is among the t causes of y ;xHinton the 1990station (Beck Becker er and Hin Hinton ton, 1992 and Sejno Sejnowski wskimost , 1999salien ), in more detail. x . This idea has guided a large amoun t of deep learning research since at least | For other arguments ab about out when semi-sup semi-supervised ervised learning can outp outperform erform pure the 1990s ( Beck er and Hin ton , 1992 ; Hinton and1.2 Sejno wski, 1999et), al. in (more detail. sup supervised ervised learning, we refer the reader to Sec. of Chapelle 2006). For other arguments ab out when semi-sup ervised learning can outp erform pure In other approaches to representation learning, we hav havee often b een concerned sup ervised learning, we refer the reader to Sec. 1.2 of Chapelle et al. (2006). with a represen representation tation that is easy to mo model—for del—for example, one whose entries are In other approaches to representation we havethat oftencleanly b een concerned sparse, or indep independen enden endentt from each other. Alearning, representation separates with a representation is easy monecessarily del—for example, one whose are the underlying causal that factors maytonot b e one that is easyentries to mo model. del. sparse, or indep enden t from each other. A representation that cleanly separates Ho How wev ever, er, a further part of the hyp ypothesis othesis motiv motivating ating semi-sup semi-supervised ervised learning the underlying causal factors may not necessarily b e one that is easythese to motdel. via unsupervised represen representation tation learning is that for man many y AI tasks, wo Ho w ev er, a further part of the h yp othesis motiv ating semi-sup ervised learning prop properties erties coincide: once we are able to obtain the underlying explanations for via unsupervised represen tation learning that for man y AI tasks, thesefrom two what we observ observe, e, it generally b ecomes easyis to isolate individual attributes propothers. erties coincide: once are able to obtain the underlying for h represen the Specifically Specifically, , if we a representation represents ts man many y ofexplanations the underlying what e observ e, it generally ecomes easyytoare isolate individual from causeswof the observ observed ed x , andbthe outputs among the mostattributes salient causes, h the others. Specifically , if a representation represen ts man y of the underlying then it is easy to predict y from h. causes of the observed x , and the outputs y are among the most salient causes, First, let us see ho how w semi-sup semi-supervised ervised learning can fail b ecause unsup unsupervised ervised then it is easy to predict y from h. learning of p(x) is of no help to learn p( y | x). Consider for example the case First, see howdistributed semi-sup ervised b)ecause ervised, | x]. Clearly p (x )let f (x = E[yunsup where is us uniformly and welearning wan antt to can learnfail Clearly, learning of ) is of no help to learn ) . Consider for example the y x p ( x p ( observing a training set of x values alone giv gives es us no informationEab about out p(y case | x). where p (x ) is uniformly distributed and we w|ant to learn f ( x) = [y x]. Clearly, 544 gives us no information ab|out p(y x). observing a training set of x values alone |
CHAPTER 15. REPRESENTATION LEARNING
Mixture mo model del y=2
Mixture mo del
y=3
p(x)
y=1
x
Figure 15.4: Example of a densit density y over x that is a mixture over three components. The comp componen onen onentt iden identity tity is an underlying explanatory factor, y. Because the mixture x that Figure 15.4: Example of aob densit y over in a mixture over three components. comp componen onen onents ts (e.g., natural object ject classes imageis data) are statistically salient, just y The comp onen t iden tity is an underlying explanatory factor, . Because thethe mixture mo modeling deling p( x) in an unsup unsupervised ervised way with no lab labeled eled example already reveals factor onents (e.g., natural ob ject classes in image data) are statistically salient, just ycomp . mo deling p( x) in an unsup ervised way with no lab eled example already reveals the factor y.
Next, let us see a simple example of ho how w semi-sup semi-supervised ervised learning can succeed. Consider the situation where x arises from a mixture, with one mixture comp componen onen onentt Next, let us see a simple example of ho w semi-sup ervised learning can succeed. p er value of y, as illustrated in Fig. 15.4. If the mixture comp componen onen onents ts are wellx Consider the situation where arises from a mixture, with one mixture comp onenat separated, then mo modeling deling p( x) rev reveals eals precisely where eac each h comp componen onen onent t is, and y p er v alue of , as illustrated in Fig. 15.4 . If the mixture comp onen ts are | x). single lab labeled eled example of each class will then b e enough to p erfectly learn p (y wellseparated, then mo,deling eals where comp onent is, and a ( x) rev But more generally generally, what pcould mak make e pprecisely (y | x) and p(x)eac b ehtied together? single lab eled example of each class will then b e enough to p erfectly learn p (y x). y is closely asso ciated with one of xtogether? , then p( x ) and ButIfmore generallyassociated , what could mak e pof (y the x)causal and p(factors x) b e tied | p(y | x) will be strongly tied, and unsupervised represen representation tation learning that | causal factors of x, then p( x ) and y disentangle is closely asso with one of the triesIfto theciated underlying factors of variation is likely to b e useful as a p (y x) ervised will belearning stronglystrategy tied, and unsupervised representation learning that semi-sup semi-supervised strategy. . tries |to disentangle the underlying factors of variation is likely to b e useful as a Consider the assumption that y is one of the causal factors of x, and let semi-sup ervised learning strategy. h represen representt all those factors. The true generativ generativee process can b e conceived as y x, and Consider the assumption that is one of mo thedel, causal structured according to this directed graphical model, with factors h as theofparen parent t of let x: h represent all those factors. The true generative process can b e conceived as structured according to this directed of x: p(h, x) =graphical p(x | h)pmo (h)del, . with h as the parent(15.1) p(hmarginal , x) = p(xprobability h)p(h). (15.1) As a consequence, the data has | As a consequence, the data haspmarginal (x) = Ehpprobability (x | h). (15.2) E p(x ) = we p(conclude x h). that the best p ossible (15.2) From this straightforw straightforward ard observ observation, ation, mo model del of x (from a generalization p oin ointt of view) is the uncovers ers the ab abov ov ovee “true” | one that uncov From this straightforward observation, we conclude that the best p ossible mo del 545is the one that uncovers the ab ove “true” of x (from a generalization p oint of view)
CHAPTER 15. REPRESENTATION LEARNING
structure, with h as a latent variable that explains the observed variations in x . The “ideal” representation learning discussed ab aboove should thus reco recov ver these laten latentt h structure, with as a latent v ariable that explains the observed v ariations factors. If y is one of these (or closely related to one of them), then it willin bxe. The learning discussed ove should thus versee these laten y from v ery “ideal” easy torepresentation learn to predict suc such h a ab represen representation. tation. Wreco e also that thet factors. If is one of these (or closely related to one of them), then it will y conditional distribution of y giv given en x is tied by Bay Bayes es rule to the comp componen onen onents ts bine vthe eryab easy learn to predict y from such a representation. We also see that the abo ove to equation: conditional distribution of y given x is ptied (x | by y )pBay (y) es rule to the comp onents in . (15.3) p ( y | x ) = the ab ove equation: p(x) p(x y )p(y) . (15.3) p(y x)tied = to the conditional p (y | x) and knowledge Th Thus us the marginal p(x) is intimately p| (x) | of the structure of the former should b e helpful to learn the latter. Therefore, in p ( x p (y x) should Th us the marginal ) is intimately tied the conditional and knowledge situations resp respecting ecting these assumptions, to semi-sup semi-supervised ervised learning impro improv ve of the structure of the former should b e helpful to learn the latter. Therefore, in | p erformance. situations resp ecting these assumptions, semi-sup ervised learning should improve An imp importan ortan ortantt researc research h problem regards the fact that most observ observations ations are p erformance. formed by an extremely large num numb b er of underlying causes. Supp Suppose ose y = h i, but imp ortant learner researcdo h problem regards theh .fact that most observations are the An unsup unsupervised ervised does es not kno know w which i The brute force solution is for = h , but formed byervised an extremely large num b er of underlying Supp an unsup unsupervised learner to learn a representation thatcauses. captures theyreasonably all ose the unsup ervised learner do es not kno w which h . The brute force solution is for salien salientt generative factors hj and disentangles them from eac each h other, thus making aneasy unsup a representation captures theyreasonably it toervised predictlearner y fromtoh,learn regardless of which hithat is asso with . associated ciated all salient generative factors h and disentangles them from each other, thus making In practice, the brute force solution is not feasible b ecause it is not p ossible it easy to predict y from h, regardless of which h is asso ciated with y . to capture all or most of the factors of variation that influence an observ observation. ation. In practice, the brute force solution is not feasible b ecause it is not p ossible For example, in a visual scene, should the representation alw alwa ays enco encode de all of to or most of the factors of v ariation that influence an observ ation. thecapture smallestallob in the background? It is a well-documented psychological objects jects For example,that in ahuman visual bscene, should the representation alwaenvironmen ys enco de tall of phenomenon eings fail to p erceiv erceive e changes in their environment that the not smallest ob jects relev in the It is are a well-documented are immediately relevan an antbackground? t to the task they p erforming—see,psychological e.g., Simons phenomenon that h uman b eings fail to p erceiv e changes in their environmen t that and Levin (1998). An imp important ortant researc research h fron frontier tier in semi-sup semi-supervised ervised learning is are not immediately relevde anin t to thesituation. task theyCurren are p erforming—see, e.g., Simons determining what to enco encode each Currently tly tly,, two of the main strategies anddealing Levin with (1998a).large An nimp ortant research fron tier are in semi-sup ervised learning is for um umb b er of underlying causes to use a sup supervised ervised learning determining enco in unsup each situation. Currently , twosoofthat the main strategies signal at thewhat same to time asdethe unsupervised ervised learning signal the mo model del will for dealing with a large n um b er of underlying causes are to use a sup ervised learning cho hoose ose to capture the most relev relevant ant factors of variation, or to use muc much h larger signal at the same time as the unsup ervised learning signal so that the mo del will represen representations tations if using purely unsup unsupervised ervised learning. cho ose to capture the most relevant factors of variation, or to use much larger An emerging strategy for unsup unsupervised ervised learning is to mo modify dify the definition of representations if using purely unsup ervised learning. whic which h underlying causes are most salien salient. t. Historically Historically,, auto autoenco enco encoders ders and generative An emerging strategy for unsup ervised learning is to mo dify the definition of mo models dels ha have ve been trained to optimize a fixed criterion, often similar to mean whic h underlying causes are criteria most salien t. Historically auto enco and generative squared error. These fixed determine whic areders considered salient. which h ,causes mo dels ha ve been trained to optimize a fixed criterion, often similar to mean For example, mean squared error applied to the pixels of an image implicitly squared These fixed criteria whict hifcauses are considered salient. sp specifies ecifies error. that an underlying cause determine is only salien salient it significan significantly tly changes the F or example, mean squared error applied to the pixels of an image implicitly brigh brightness tness of a large num numb b er of pixels. This can b e problematic if the task we sp ecifies thatin an underlying cause is small only salien t ifSee it significan hanges the wish to solve olv interacting with ob Fig. 15.5tly for can example inv volves es objects. jects. brightness of a large numb er of pixels. This can b e problematic if the task we wish to solve involves interacting with small 546 ob jects. See Fig. 15.5 for an example
CHAPTER 15. REPRESENTATION LEARNING
Input
Reconstruction
Input
Reconstruction
Figure 15.5: An auto autoenco enco encoder der trained with mean squared error for a rob robotics otics task has failed to reconstruct a ping p ong ball. The existence of the ping p ong ball and all of its Figure co 15.5: An auto der trained with mean for a robthe oticsimage task and has spatial coordinates ordinates areenco imp importan ortan ortant t underlying causalsquared factors error that generate failed to an reconstruct a ping ong ball. The existence the pingder p ong and capacity all of its, are relev relevan ant t to the rob robotics otics ptask. Unfortunately Unfortunately, , the of autoenco autoencoder has ball limited capacity, spatial ordinates are mean imp ortan t underlying factors that generate and and thecotraining with squared error didcausal not identify the ping p ongthe ballimage as b eing are relev ant to to theenco robde. otics task. graciously Unfortunately , the by autoenco derFinn. has limited capacity, salien salient t enough encode. Images provided Chelsea and the training with mean squared error did not identify the ping p ong ball as b eing salient enough to enco de. Images graciously provided by Chelsea Finn.
of a rob robotics otics task in which an auto autoenco enco encoder der has failed to learn to enco encode de a small ping p ong ball. This same rob robot ot is capable of successfully interacting with larger of a rob otics task in whichwhich an auto has failed to learn to enco de a small ob objects, jects, suc such h as baseballs, areenco moreder salient according to mean squared error. ping p ong ball. This same rob ot is capable of successfully interacting with larger Other definitions of salience are p ossible. For example, if a group of pixels ob jects, such as baseballs, which are more salient according to mean squared error. follo follow w a highly recognizable pattern, even if that pattern do does es not inv involv olv olvee extreme Other definitions of salience are p ossible. F or example, if a group pixelst. brigh brightness tness or darkness, then that pattern could b e considered extremelyofsalien salient. followwa ayhighly recognizable evenofif salience that pattern invtly olvdeveloped e extreme One way to implemen implement t suchpattern, a definition is to do useesanot recen recently brigh tness or darkness, thenadversarial that pattern could b(eGo considered extremely salien approac approach h called gener generative ative networks Goodfellow odfellow et al., 2014c ). Int. One wa y to implemen t such a definition of salience is to use a recen tly developed this approac approach, h, a generativ generativee mo model del is trained to fo fool ol a feedforw feedforward ard classifier. The approac hard called gener ative adversarial networks (Go odfellow al., 2014cmo ). del In feedforw feedforward classifier attempts to recognize all samples from theetgenerative model thisb eing approac a generativ e mo del the is trained ol b a eing feedforw classifier. The as fak fake, e,h,and all samples from trainingtosetfoas real. ard In this framew framework, ork, feedforw ard classifier attempts to recognize all samples from the generative mo del an any y structured pattern that the feedforward netw network ork can recognize is highly salient. as b eing fake, and all samples from seted as in b eing In this framew ork,. The generative adversarial netw network ork the willtraining b e describ described morereal. detail in Sec. 20.10.4 anory the structured pattern the feedforward ork can trecognize is highly salient. F purp purposes oses of thethat present discussion, itnetw is sufficien sufficient to understand that they The generative adversarial netw ork will b e describ ed in more detail in Sec. 20.10.4 how w to determine what is salient. Lotter et al. (2015) sho show wed that mo models dels. learn ho For the to purp oses of images the present discussion, is sufficien t to understand trained generate of human heads itwill often neglect to generatethat the they ears learn ho w to determine what is salient. Lotter et al. ( 2015 ) sho w ed that mo dels when trained with mean squared error, but will successfully generate the ears when trained with to generate images offramework. human heads will often neglect to generate the ears trained the adversarial Because the ears are not extremely bright when trained with to mean error,skin, but they will successfully generate thet ears when or dark compared the squared surrounding are not esp especially ecially salien salient according trained the adversarial Because the ears areshape not extremely brightt to meanwith squared error loss,framework. but their highly recognizable and consisten consistent or dark compared to the surrounding skin, they are not esp ecially salient according to mean squared error loss, but their 547 highly recognizable shape and consistent
CHAPTER 15. REPRESENTATION LEARNING
Ground Truth
MSE
Adv dversarial ersarial
Ground Truth
MSE
Adversarial
Figure 15.6: Predictive generativ generativee net netw works provide an example of the imp importance ortance of learning which features are salient. In this example, the predictiv predictivee generative netw network ork Figure 15.6: Predictive generativ e net w orks provide an example of the imp ortance of has b een trained to predict the app appearance earance of a 3-D mo model del of a human head at a sp specific ecific learning angle. which (L features are salient. thisis example, theimage, predictiv generative ork viewing (Left) eft) Ground truth. In This the correct thate the net network worknetw should has b een trainedImage to predict the app of a 3-D mo del eofnetw a human head atwith a spmean ecific emit. (Center) pro produced duced byearance a predictive generativ generative network ork trained (Left) Because viewing angle. Ground the truth. is the correct image, that the netinwork should squared error alone. earsThis do not cause an extreme difference brightness (Center) emit. Image pro duced by a predictive generativ e netw ork trained with compared to the neighboring skin, they were not sufficiently salient for the mo model del to mean learn squared errorthem. alone.(Right) Because the ears do notby cause an in brightness to represent Image pro produced duced a mo model delextreme traineddifference with a combination of compared to the neighboring skin, they were not sufficiently salient for the mothe del ears to learn mean squared error and adversarial loss. Using this learned cost function, are to represent them. pro duced by a Learning mo del trained a combination of salien salient t b ecause they(Right) follo follow w aImage predictable pattern. whichwith underlying causes are mean squared andenough adversarial loss. this learned cost function, the ears are imp importan ortan ortant t and error relev relevant ant to mo model del isUsing an imp important ortant active area of research. Figures salient b ecause theyby follo w a predictable Learning which underlying causes are graciously provided Lotter et al. (2015pattern. ). imp ortant and relevant enough to mo del is an imp ortant active area of research. Figures graciously provided by Lotter et al. (2015).
p osition means that a feedforw feedforward ard netw network ork can easily learn to detect them, making them highly salient under the generative adv adversarial ersarial framework. See Fig. 15.6 p osition means that a Generative feedforward adv netw ork can easily learn detect for example images. netw aretoonly onethem, step making tow adversarial ersarial networks orks toward ard them highly salient under the generative adv ersarial framework. See Fig. 15.6 determining which factors should b e represented. We exp expect ect that future researc research h for example images. Generative adv ersarial netw orks are only one step tow ard will disco discov ver b etter wa ways ys of determining which factors to represent, and develop determining which factors should b e represented. We expon ectthe that future research mec mechanisms hanisms for representing different factors dep depending ending task. will discover b etter ways of determining which factors to represent, and develop A b enefit of learning the underlying causal factors, as p oin ointed ted out by Sc Schölk hölk hölkopf opf mechanisms for representing different factors dep ending on the task. et al. (2012), is that if the true generative pro process cess has x as an effect and y as A b enefit of learning the underlying causal factors, as pp(oin Schölkopf y )ted a cause, then modeling p(x | y) is robust to changes in . If out the by cause-effect x et al. ( 2012 ), is that if the true generative pro cess has as an effect and as) relationship was rev reversed, ersed, this would not b e true, since by Ba Bay yes rule, p (xy| y p(x yin ( y ).consider a ould cause, modeling ) isp(robust to often, changes in pwe If the cause-effect y). Very w b ethen sensitive to changes when changes in relationship was revdifferent ersed, this would not b eoral true, since by Bayyes, or rule, p (x yin) | domains, distribution due to temp temporal non-stationarit non-stationarity changes w ould b e sensitive to changes in p( ymec ). Vhanisms ery often, remain when wein consider in | ws the nature of the task, (“the la laws the causal mechanisms inv variant changes distribution due to different domains, temp oral non-stationarit y , or c hanges in of the universe are constan constant”) t”) while the marginal distribution ov over er the underlying the better causalgeneralization mechanisms and remain invariant the nature the task, (“the laws causes can of change. Hence, robustness to all kinds of of the universe are constan t”) while the marginal distribution ov er the underlying changes can b e exp expected ected via learning a generative mo model del that attempts to reco recov ver causes can change. Hence, better generalization and robustness to all kinds of changes can b e exp ected via learning a generative mo del that attempts to recover 548
CHAPTER 15. REPRESENTATION LEARNING
the causal factors h and p(x | h). the causal factors h and p(x h). 15.4 Distributed Represen Representation tation | Distributed representationsRepresen of concepts—representations comp composed osed of man many y ele15.4 Distributed tation men ments ts that can b e set separately from each other—are one of the most imp importan ortan ortantt Distributed representations of concepts—representations comp of man y eleto tools ols for representation learning. Distributed representations are osed p owerful b ecause mentscan thatuse cann b e set separately other—are one of the most impAs ortan they features with k vfrom alueseach to describ describe e k n differen different t concepts. wet to ols for representation learning. Distributed representations are p o w erful b ecause ha hav ve seen throughout this b o ok, b oth neural netw networks orks with multiple hidden units they can use features with v alues to describ e differen t concepts. As we n k k mak and probabilistic mo models dels with multiple latent variables make e use of the strategy of have seen throughout this b oW ok, b othintroduce neural netw with multiple hiddenMan unitsy distributed representation. We e now anorks additional observ observation. ation. Many and probabilistic mo dels with multiple variables makethat use of strategy of deep learning algorithms are motiv motivated atedlatent by the assumption thethe hidden units distributed We now introduce an additional observation. Manasy can learn torepresentation. represent the underlying causal factors that explain the data, deep learning algorithms are motiv ated by the assumption that the hidden units discussed in Sec. 15.3. Distributed representations are natural for this approac approach, h, can learneach to represent therepresentation underlying causal that ond explain thevalue data,ofas b ecause direction in spacefactors can corresp correspond to the a discussed in Sec. 15.3 . Distributed representations are natural for this approac h, differen differentt underlying configuration variable. b ecause each direction in representation space can corresp ond to the value of a An example of a distributed represen representation tation is a vector of n binary features, different underlying configuration variable. n whic which h can take 2 configurations, each p oten otentially tially corresp corresponding onding to a differen differentt n binary An in example of a distributed represen tation is. aThis vector features, region input space, as illustrated in Fig. 15.7 canof be compared with whic h can take 2 configurations, each p oten tially corresp onding to a differen a symb symbolic olic repr epresentation esentation esentation,, where the input is asso associated ciated with a single symbol ort region in input space, as illustrated in Fig. 15.7 . can be compared with n feature category category.. If there are n sym symb b ols in the dictionary dictionary,,This one can imagine a symbolic each representation , where is asso with a of single symbol or detectors, corresp corresponding onding to the the input detection of ciated the presence the asso associated ciated n sym category If that therecase are only ols in tthe dictionary, one can imagine n feature nb category category... In differen different configurations of the representation space detectors, each corresp onding to the detection of the presence of the asso ciated are p ossible, carving n differen differentt regions in input space, as illustrated in Fig. 15.8. n differen category . In that case only configurations the representation space Suc Such h a symbolic represen representation tation is alsotcalled a one-hotofrepresentation, since it can n are p ossible, carving differen t regions in input space, as illustrated in Fig. 15.8 b e captured by a binary vector with n bits that are mutually exclusive (only one. Sucthem h a symbolic representation isbolic also representation called a one-hotisrepresentation, since of it can of can be active). A sym symbolic a sp specific ecific example the n b e captured b y a binary vector with bits that are mutually exclusive (only one broader class of non-distributed representations, which are representa representations tions that of them can be active). A sym bolic representation is a sp ecific example of ov the ma may y con contain tain many entries but without significan significantt meaningful separate control over er broader class of non-distributed representations, which are representa tions that eac each h en entry try try.. may contain many entries but without significant meaningful separate control over of learning algorithms based on non-distributed represen representations tations eachExamples entry. include: Examples of learning algorithms based on non-distributed representations include: • Clustering metho methods, ds, including the k-means algorithm: each input p oint is assigned to exactly one cluster. Clustering metho ds, including the k-means algorithm: each input p oint is • assigned k-nearestto exactly one cluster. one or a few templates or prototype examples neigh neighbors bors algorithms: are asso associated ciated with a given input. In the case of k > 1, there are multiple k-nearest neighbors algorithms: one or a few templates or prototype examples In the case of k > 1, there are multiple • are asso ciated with a given input. 549
CHAPTER 15. REPRESENTATION LEARNING
h2
h3
h
h h = [1, 0, 0]
>
h = [1, 1, 0]>
h = [1, 0, 1]> h = [1, 1, 1]>
h1 h = [0, 1, 0]>
h = [0, 1, 1]>
h = [0, 0, 1]>
h
Figure 15.7: Illustration of ho how w a learning algorithm based on a distributed represen representation tation breaks up the input space into regions. In this example, there are three binary features Figure of howisa defined learningby algorithm basedthe on aoutput distributed representation h and hIllustration thresholding of a learned, linear 1, h2 , 15.7: 3 . Each feature + breaks up the input intodivides regions. this example, thereLet arehthree binary features transformation. Eachspace feature in into to tw two o half-planes. b e the set of input R2 In i hoin h . Each , hts, for andwhich is− defined the output of ahilearned, hi =feature p oints 1 and h setthresholding of input p oints for which = 0. Inlinear this R by i b e the h transformation. Each feature divides in to tw o half-planes. Let b e the set ofonding input illustration, each line represents the decision b oundary for one h i, with the corresp corresponding h b e theoundary. = h1+and 0. Intak p oinw ts pfor whichtohthe set of input oints for whichash a=whole this arro arrow ointing . Theprepresentation takes es i side of the b oundary h illustration, the decision b oundary for half-planes. one , with the onding on a uniqueeach valueline at represents each p ossible in intersection tersection of these For corresp example, the + + as a whole takes > arro w p ointing the of theonds b oundary The representation h+ represen representation tation vto alue [1 [1,,h1 , 1]side corresp corresponds to the. region 1 ∩ h2 ∩ h3 . Compare this to on anon-distributed unique value at each p ossibleininFig. tersection these half-planes. or example, the the representations 15.8. Inofthe general case of d F input dimensions, h h rather h . Compare tationrepresentation value [1 , 1 , 1] divides corresp to the region this to arepresen distributed by intersecting half-spaces than half-planes. R donds the non-distributed representations in 15.8. In the general of dtoinput ∩ case ∩des The distributed representation with features assigns unique co codes ) different O(nddimensions, R nFig. a distributed divides by intersecting half-spaces rather thancodes half-planes. regions, while representation the nearest neigh neighbor bor algorithm with n examples assigns unique to only ) different n features O(n tially The distributed representation withtation co desexp to onen The distributed represen representation is thusassigns able tounique distinguish exponen onentially man many y n regions. n regions, while the nearest neigh bor algorithm with examples assigns unique codes to only more regions than the non-distributed one. Keep in mind that not all h values are feasible n regions. representation is thus ableclassifier to distinguish tially many (there is noThe and that a linear on topexp of onen the distributed h =distributed 0 in this example) more regions than theable non-distributed one. Keep mind thattonot all hneighboring values are feasible represen representation tation is not to assign different classinidentities every region; h = 0 (there is no in this example) and that a linear classifier on top of the distributed ev even en a deep linear-threshold net netw work has a VC dimension of only O (w log w) where w represen tation is weigh not able assign different identities of to aevery neighboring region; is the num numb b er of weights ts (to Sontag , 1998 ). Theclass combination p owerful representation O (w log w) where w evyen deep linear-threshold net orkbhas a VC dimension la lay er aand a weak classifier lay layer er w can e a strong regularizer;ofaonly classifier trying to learn is the numb er weightsversus (Sontag , 1998 ). Thedo combination p owerful representation the concept of of “p “person” erson” “not a p erson” does es not need of to aassign a different class to layer andrepresented a weak classifier layer with can bglasses” e a strong regularizer; a classifier toted learn an input as “woman than it assigns to an input trying represen represented as the concept “p erson”This versus “not a constraint p erson” doencourages es not needeach to assign a different to “man withoutofglasses.” capacity classifier to fo focus cusclass on few an input represented as “woman with glasses” than it assigns to an input represen ted as and encourages h to learn to represent the classes in a linearly separable wa . way y hi “man without glasses.” This capacity constraint encourages each classifier to fo cus on few h and encourages h to learn to represent the 550 classes in a linearly separable way.
CHAPTER 15. REPRESENTATION LEARNING
values describing eac each h input, but they can not be con controlled trolled separately from eac each h other, so this do does es not qualify as a true distributed represen representation. tation. values describing each input, but they can not be controlled separately from each other, so this es leaf not qualify ades trueondistributed represen trees: onlydo one (and theasno nodes the path from ro root ot tation. to leaf leaf)) is • Decision activ activated ated when an input is given. Decision trees: only one leaf (and the no des on the path from ro ot to leaf ) is activated when an input is given.of experts: the templates (cluster centers) mixtures and mixtures • Gaussian or exp experts erts are now asso associated ciated with a degree of activ activation. ation. As with the Gaussian mixtures and mixtures of experts: the templates k -nearest neigh neighb b ors algorithm, each input is represented with(cluster multiplecenters) values, degree or exp erts are now asso ciated with a of activ ation. As with the • but those values cannot readily b e controlled separately from each other. k -nearest neighb ors algorithm, each input is represented with multiple values, • Kernel but those values b e controlled from each mac machines hines cannot with areadily Gaussian kernel (or separately other similarly lo local cal other. kernel): although the degree of activ activation ation of eac each h “supp “support ort vector” or template example Kernel mac hines with a Gaussian kernel (or similarly lo cal kernel): is now contin continuous-v uous-v uous-valued, alued, the same issue arisesother as with Gaussian mixtures. • although the degree of activation of each “supp ort vector” or template example is now contin alued, mo thedels samebased issue on arises as with The Gaussian or uous-v translation models set ofmixtures. con contexts texts n -grams. • Language (sequences of symbols) is partitioned according to a tree structure of suffixes. Language translation on nb-grams. Thewset of contexts A leaf mayor corresp correspond ond to mo thedels last based two words eing w 1 and 2, for example. of symbols) partitioned to the a tree of suffixes. • (sequences Separate parameters areisestimated foraccording each leaf of treestructure (with some sharing w w A leaf may corresp ond to the last t w o words b eing and , for example. b eing p ossible). Separate parameters are estimated for each leaf of the tree (with some sharing b eing For somep ossible). of these non-distributed algorithms, the output is not constan constantt by parts but instead interpolates betw between een neighboring regions. The relationship F or some of these non-distributed thethe output constanthey t by b et etw ween the num numb b er of parameters (or algorithms, examples) and num umb bisernot of regions partsdefine but remains instead interpolates between neighboring regions. The relationship can linear. b etween the numb er of parameters (or examples) and the numb er of regions they An imp importan ortan ortantt related concept that distinguishes a distributed representation can define remains linear. from a sym symb b olic one is that generalization arises due to shared attributes An imp ortan related concept distinguishes a distributed b et etw ween differenttconcepts. As purethat symbols, “cat ” and “ dog” are asrepresentation far from eac each h arises due them to shared attributes from aassym olic one that generalization other anybother tw twooissymbols. How Howev ev ever, er, if one asso associates ciates with a meaningful cat ” and “ dog” are as far from each b etween different concepts. pure symbols, distributed represen representation, tation, As then man many y of the“ things that can b e said ab about out cats other as any other tw o symbols. How ev er, if one asso ciates them with a meaningful can generalize to dogs and vice-v vice-versa. ersa. For example, our distributed representation distributed represen tation, then many of “the things that can b e said outsame cats number_of_legs ma may y con contain tain en entries tries suc such h as “has_fur ” or ” that ha hav veabthe to dogs andofvice-v ersa. example, our distributed vcan aluegeneralize for the embedding b oth “ cat”For and “ dog.” Neural languagerepresentation mo models dels that has_fur number_of_legs ma y con tain en tries suc h as “ ” or “ ” that ha v e the op operate erate on distributed represen representations tations of words generalize muc uch h b etter than same other cat dog vmo alue for the embedding of b oth “ ” and “ .” Neural language mo dels that models dels that op operate erate directly on one-hot represen representations tations of words, as discussed op erate on distributed represen tations of words generalize m uc h b etter than other in Sec. 12.4. Distributed representations induce a rich similarity sp spac ac acee, in which mo delstically that op erate directly(or oninputs) one-hotare represen tations of words, as discussed seman semantically close concepts close in distance, a prop property erty that is in Sec. 12.4 . Distributed representations induce a rich similarity sp ac e , in which absen absentt from purely symbolic representations. semantically close concepts (or inputs) are close in distance, a prop erty that is When and wh why y symbolic can thererepresentations. b e a statistical adv advan an antage tage from using a distributed absen t from purely represen representation tation as part of a learning algorithm? Distributed represen representations tations can When and why can there b e a statistical advantage from using a distributed representation as part of a learning algorithm? Distributed representations can 551
CHAPTER 15. REPRESENTATION LEARNING
Figure 15.8: Illustration of how the nearest neighbor algorithm breaks up the input space in into to different regions. The nearest neighbor algorithm pro provides vides an example of a learning Figure 15.8: Illustration of how the nearest neighbor algorithm breaks up the input space algorithm based on a non-distributed representation. Different non-distributed algorithms in to different regions. The nearest neighbor algorithm pro vides an example of a learning ma may y hav havee different geometry geometry,, but they typically break the input space into regions, with algorithm based non-distributed Different non-distributed algorithms adv advantage antage of a non-distributed a separate set on of aparameters for representation. eac each h region. The witha may havhe is different geometry , butparameters, they typically break space regions, approac approach that, giv given en enough it can fit the input training set into without solving a separate set of parameters for eac h region . The adv of aa non-distributed difficult optimization algorithm, b ecause it is straightforw straightforward ardantage to choose different output approac h is that, giveac en henough it tage can fit set without solving indep independen enden endently tly for each region.parameters, The disadv disadvan an antage is the thattraining such non-distributed mo models delsa difficult optimization b ecause is straightforw to choose a different output generalize only lo locally callyalgorithm, via the smo smoothness othnessitprior, making itard difficult to learn a complicated indep enden for peac h region. The disadv antage is thatnum such motrast dels function withtly more eaks and troughs than the av available ailable number bernon-distributed of examples. Con Contrast generalize only lo cally via the smo othness prior, making it difficult to learn a complicated this with a distributed representation, Fig. 15.7. function with more p eaks and troughs than the available number of examples. Contrast this with a distributed representation, Fig. 15.7.
552
CHAPTER 15. REPRESENTATION LEARNING
ha hav ve a statistical adv advan an antage tage when an apparen apparently tly complicated structure can b e compactly represented using a small num numb b er of parameters. Some traditional nonhave a statistical antage when an apparen tly to complicated structure can b e distributed learningadv algorithms generalize only due the smo smoothness othness assumption, compactly represented smallthe num b er offunction parameters. traditional f toSome ≈ v, athen whic which h states that if u using target b e learned has nonthe distributed learning algorithms generalize only due to the smo othness assumption, prop propert ert erty y that f ( u) ≈ f(v ), in general. There are many wa ways ys of formalizing such an f to b(ex,learned u result v, then whic h states but thatthe if end the if target function hashthe y ) for whic assumption, is that we hav have e an example which we f ( u ) ( v f prop ert y that ) , in general. There are many wa ys of formalizing such an ˆ ≈ kno know w that f ( x) ≈ y , then we cho hoose ose an estimator f that appro approximately ximately satisfies y ) efor assumption, but the result is as that if we e an example h we ≈ endchanging these constraints while little as hav p ossible when we(x, mov move to whic a nearby ˆ that approximately satisfies f ( x ) f y kno w that , then w e c ho ose an estimator input x + . This assumption is clearly very useful, but it suffers from the curse of these constraints little as p ossiblethat when we movand e to decreases a nearby dimensionalit dimensionality: y: ≈ inwhile orderchanging to learn as a target function increases input This assumption clearly + . in 1 very useful, but it suffers from the curse of man many yx times man many y different is regions, we may need a num numb b er of examples that is dimensionalit y: in order to learn a target function that increases and of decreases at least as large as the num number ber of distinguishable regions. One can think eac each h of man y times in man y different regions, w e may need a num b er of examples that is these regions as a category or sym symb b ol: by having a separate degree of freedom for at least as large the num distinguishable regions. One can think of symbol each of eac each h symbol (or as region), weber canoflearn an arbitrary deco decoder der mapping from these regions aer,category or not symallow b ol: byushaving a separate degree of freedom for to value. Ho How wasev ever, this do does es to generalize to new symbols for new each symbol (or region), we can learn an arbitrary deco der mapping from symbol regions. to value. However, this do es not allow us to generalize to new symbols for new If we are lucky lucky,, there may b e some regularit regularity y in the target function, besides b eing regions. smo smooth. oth. For example, a conv convolutional olutional netw network ork with max-p max-po o oling can recognize an If w e are lucky , there may b e some regularit y in the target besides b eing ob object ject regardless of its lo location cation in the image, even thoughfunction, spatial translation of smo oth. F or example, a conv olutional netw ork with max-p o oling can recognize an the ob object ject ma may y not corresp correspond ond to smo smooth oth transformations in the input space. ob ject regardless of its lo cation in the image, even though spatial translation of us examine sp special ecialond casetoofsmo a distributed representation learning the Let ob ject may notacorresp oth transformations in the input algorithm, space. that extracts binary features by thresholding linear functions of the input. Each Let us examine a sp ecial case of a distributed algorithm, d binary feature in this representation divides Rrepresentation in into to a pair learning of half-spaces, as that extracts binary features by thresholding linear functions of the input. Each illustrated in Fig. 15.7. The exp exponentially onentially large umber ber of intersections of n R num binary feature in this representation divides in to a pair of this half-spaces, as of the corresp corresponding onding half-spaces determines how many regions distributed n illustrated in Fig. 15.7can . The exp onentially umber of represen representation tation learner distinguish. How large manynregions areintersections generated byofan of the corresp half-spaces how amany regions distributed d n hyp arrangemen arrangement t ofonding yperplanes erplanes in Rdetermines ? By applying general resultthis concerning the represen tation learner can distinguish. How many regions are generated b y an) in intersection tersection of hyperplanes (Zaslavsky Pascanu u et al. al.,, 2014b R , 1975), one can show (Pascan n arrangemen t of h yp erplanes in ? By applying a general result concerning the that the num numb b er of regions this binary feature representation can distinguish is intersection of hyperplanes (Zaslavsky, 1975), one can show (Pascanu et al., 2014b) d feature representation can distinguish is that the numb er of regions this X binary n = O (n d). (15.4) j j=0 n = O (n ). (15.4) j Therefore, we see a growth that is exp exponen onen onential tial in the input size and p olynomial in the num numb b er of hidden units. Therefore, we see a growth that is exp onential in the input size and p olynomial in 1 Potentially, we may want to learn a function the numb er of hidden units. X whose behavior is distinct in exponentially many regions: in a d-dimensional space with at least 2 different values to distinguish per dimension, we might want f to differ in 2d different regions, requiring O(2 d) training examples.
553
CHAPTER 15. REPRESENTATION LEARNING
This pro provides vides a geometric argument to explain the generalization pow power er of distributed representation: with O (nd) parameters (for n linear-threshold features a geometric argument to explain thespace. generalization powmade er of d ) wepro in RThis canvides distinctly represent representO in input If instead we O (nd) regions O ( nd n distributed representation: parameters (for linear-threshold no about outwith the data,)and used a representation with onefeatures unique Rassumption at all ab in ) we can distinctly represent ) regions in input space. If instead we made O ( n sym symb b ol for each region, and separate parameters for each symbol to recognize its no assumption all abofout data, and used with one unique d O (anrepresentation O (nd ) R dthe corresp corresponding onding pat ortion , then sp specifying ecifying ) regions would require symb ol forMore each region, and separate parameters forthe each symbol torepresen recognize its examples. generally generally, ,Rthe argument in fa fav vor of distributed representation tation O (nwe) correspbonding p ortion of case , then sp ecifying ) regions require could e extended to the where insteadOof(nusing linearwould threshold units examples. More generally , the argument in fa v or of the distributed represen tation use nonlinear, p ossibly contin continuous, uous, feature extractors for each of the attributes in could b e extended to the case where instead of linear threshold units we the distributed represen representation. tation. The argument in using this case is that if a parametric use nonlinear, pwith ossibly continuous, can feature for eachinofinput the attributes in k parameters transformation learnextractors ab about out r regions space, with the represensuc tation. The argument this case is task that of if in a terest, parametric k distributed r, and if obtaining such h a representation wasinuseful to the interest, then k r transformation with parameters can learn ab out regions in input space, with we could p otentially generalize muc uch h b etter in this wa way y than in a non-distributed k r, and if obtaining sucneed h a representation wastouseful to the of in terest, then O (r) examples setting where we would obtain thetask same features and w e could p otentially generalize m uc h b etter in this wa y than in a non-distributed asso associated ciated partitioning of the input space into r regions. Using few fewer er parameters to O ( r setting where we w ould need ) examples to obtain the same and represen representt the mo model del means that we hav havee few fewer er parameters to fit, andfeatures thus require asso ciated partitioning of the input space into regions. Using few er parameters to r far fewer training examples to generalize well. represent the mo del means that we have fewer parameters to fit, and thus require A further part of the argument for why mo models dels based on distributed represenfar fewer training examples to generalize well. tations generalize well is that their capacity remains limited despite being able to A further part of man the yargument why mo dels based on represendistinctly enco encode de so many different for regions. For example, thedistributed VC dimension of a tations generalize well is that their capacity remains limited despite being able to neural netw network ork of linear threshold units is only O(w log w ), where w is the num number ber distinctly de man).y This different regions. For bexample, the V C can dimension a of weigh weights ts enco (Son Sontag tagso , 1998 limitation arises ecause, while we assign vofery O(w loguse w )absolutely neural network of linear threshold units is only , where w is numco ber man many y unique co codes des to representation space, we cannot allthe of the code de of weigh ts ( Son tag , 1998 ). This limitation arises b ecause, while we can assign v ery space, nor can we learn arbitrary functions mapping from the represen representation tation space man y unique co des to representation space, we cannot use absolutely all of the co de h to the output y using a linear classifier. The use of a distributed representation space, nor with can we learn arbitrary mapping frombthe tation space com combined bined a linear classifier functions th thus us expresses a prior eliefrepresen that the classes to h y to the output using a linear classifier. The use of a distributed representation b e recognized are linearly separable as a function of the underlying causal factors combined bwith linear us expresses prior b elief that to captured y h . aW We e willclassifier typicallythwan want t to learn acategories such as the the classes set of all b e recognized are linearly separable as a function of the underlying causal factors images of all green ob objects jects or the set of all images of cars, but not categories that captured by h . WXOR e willlogic. typically want to learn categories suchwan as tthe of all require nonlinear, For example, we typically do not to set partition want images all green or the of green all images cars, categories the dataofinto the setobofjects all red carsset and truc trucks ksofas onebut classnot and the set ofthat all require nonlinear, XOR logic. F or example, we typically do not wan t to partition green cars and red trucks as another class. the data into the set of all red cars and green trucks as one class and the set of all The ideas so as faranother hav havee b een abstract, but they ma may y b e exp experimen erimen erimentally tally green cars anddiscussed red trucks class. validated. Zhou et al. (2015) find that hidden units in a deep con convolutional volutional The discussed far have and b eenPlaces abstract, but they datasets may b e exp erimen tally net netw w ork ideas trained on the so ImageNet b enc enchmark hmark learn features vthat alidated. Zhou et in al.terpretable, (2015) findcorresp that hidden in ela that deep humans convolutional are very often interpretable, corresponding onding units to a lab label would net w ork trained on the ImageNet and Places b enc hmark datasets learn features naturally assign. In practice it is certainly not alw alwa ays the case that hidden units that are v ery often in terpretable, corresp onding to a it labiselinteresting that humans would learn something that has a simple linguistic name, but to see this naturally assign. In practice not alw ays deep the case emerge near the top levels of it theis bcertainly est computer vision net netw wthat orks.hidden What units suc such h learn something that has a simple linguistic name, but it is interesting to see this 554 emerge near the top levels of the b est computer vision deep networks. What such
CHAPTER 15. REPRESENTATION LEARNING
-
+
=
-
+
=
Figure 15.9: A generative model has learned a distributed representation that disen disentangles tangles the concept of gender from the concept of wearing glasses. If we b egin with the repreFigure 15.9: A generative has learned a distributed representation disentangles sen sentation tation of the concept ofmodel a man with glasses, then subtract the vector that representing the the concept gender fromglasses, the concept of wearing glasses. If representing we b egin with repreconcept of a of man without and finally add the vector thethe concept senatation of the concept of a man with glasses, thenrepresenting subtract thethe vector representing the of woman without glasses, we obtain the vector concept of a woman concept of a The mangenerative without glasses, and finally add vector the concept with glasses. mo model del correctly deco decodes des the all of these representing represen representation tation vectors to of a woman without glasses, weasobtain the vector representing concept ofduced a woman images that may b e recognized b elonging to the correct class. the Images repro reproduced with with glasses.from TheRadford generative correctly deco des all of these representation vectors to p ermission et mo al. del (2015 ). images that may b e recognized as b elonging to the correct class. Images repro duced with p ermission from Radford et al. (2015).
features ha hav ve in common is that one could imagine learning ab about out eac each h of them without having to see all the configurations of all the others. Radford learning ab out each of them features hav)e demonstrated in common is that could imagine et al. (2015 thatone a generativ generative e mo model del can learn a representation of without to separate see all directions the configurations of allspace the capturing others. Radford images of having faces, with in representation different et al. (2015)factors demonstrated that a Fig. generativ mo del can learn a representation of underlying of variation. 15.9e demonstrates that one direction in images oftation faces,space with separate directions in representation capturing different represen representation corresp corresponds onds to whether the p erson space is male or female, while underlying factors of v ariation. Fig. 15.9 demonstrates that one direction in another corresp corresponds onds to whether the p erson is wearing glasses. These features were represen tation space corresp to awhether the p erson maletoorha female, while disco automatically fixed priori. There is noisneed for discov vered automatically, , notonds hav ve lab labels els another corresp onds to whether the p erson is wearing glasses. These features were the hidden unit classifiers: gradien gradientt descent on an ob objectiv jectiv jectivee function of in interest terest disco v ered automatically , not fixed a priori. There is no need to ha v e lab els for naturally learns seman semantically tically interesting features, so long as the task requires the hidden unit classifiers: gradien t descent on an ob jectiv e function of in terest suc such h features. W Wee can learn about the distinction b et etw ween male and female, or naturally learns seman tically interesting features, so long to ascthe task requires ab about out the presence or absence of glasses, without having haracterize all of suchconfigurations features. We of canthe learn thefeatures distinction b etween co male andall female, or n −about the 1 other by examples cov vering of these ab out the presence or absence of of glasses, without having to characterize all to of com combinations binations of values. This form statistical separability is what allo allows ws one n the configurations of the 1 other features b y examples co v ering all of these generalize to new configurations of a p erson’s features that ha hav ve nev never er b een seen com binations of v alues. This form of statistical separability is what allo ws one to − during training. generalize to new configurations of a p erson’s features that have never b een seen during training.
555
CHAPTER 15. REPRESENTATION LEARNING
15.5
Exp Exponen onen onential tial Gains from Depth
W e hav havee seen inonen Sec. 6.4.1 multila ultilay yer p erceptrons universal ersal approximators, 15.5 Exp tialthat Gains from Depthare univ and that some functions can b e represen represented ted by exp exponen onen onentially tially smaller deep netw networks orks W e hav e seen in Sec. 6.4.1 that m ultila y er p erceptrons are univ ersal approximators, compared to shallow netw networks. orks. This decrease in model size leads to improv improved ed and that some functions can b e represen ted by exp onen tially smaller deep netw orks statistical efficiency efficiency.. In this section, we describ describee how similar results apply more comparedtotoother shallow netw in model leads to improved generally kinds of orks. mo models delsThis withdecrease distributed hiddensize representations. statistical efficiency. In this section, we describ e how similar results apply more In Sec.to15.4 , wekinds saw an example of adistributed generative hidden mo model del that learned ab about out the generally other of mo dels with representations. explanatory factors underlying images of faces, including the p erson’s gender and In Sec. 15.4 , wwearing e saw anglasses. exampleThe of agenerativ generative model delthat that accomplished learned ab outthis the whether they are generative e mo model explanatory factors of faces, including p erson’s to gender and task was based on aunderlying deep neuralimages net netw work. It would not b ethe reasonable exp expect ect a whether they are wearing glasses. The generativ e mo del that accomplished this shallo shallow w net netw work, suc such h as a linear netw network, ork, to learn the complicated relationship task was based on a deep neural net w ork. It would not bin e reasonable a b et etw ween these abstract explanatory factors and the pixels the image. to In exp thisect and shallow work,the sucfactors h as a that linearcan netw to learn theindep complicated other AInet tasks, b eork, chosen almost independently endentlyrelationship in order to b etween these abstract and the in thein image. this and generate data are more explanatory likely to b e factors very high-lev high-level el pixels and related highlyInnonlinear other tasks, theW factors chosen almost indep endently in order to w ays toAIthe input. e arguethat thatcan thisb edemands representations, deep distributed generate arelev more likely to(seen b e very high-levelofand where thedata higher level el features as functions therelated input) in or highly factorsnonlinear (seen as deep w a ys to the input. W e argue that this demands distributed representations, generativ generativee causes) are obtained through the comp composition osition of many nonlinearities. where the higher level features (seen as functions of the input) or factors (seen as It hase been pro prov venobtained in manythrough differen differentthe t settings that organizing computation generativ causes) are comp osition of many nonlinearities. through the comp composition osition of man many y nonlinearities and a hierarch hierarchy y of reused features It has been pro v en in many differen t settings that organizing can give an exp exponential onential b o ost to statistical efficiency efficiency,, on top of thecomputation exponential through the comp osition of man y nonlinearities and a hierarch y of reused features b oost given by using a distributed representation. Man Many y kinds of net netw works (e.g., can give an expnonlinearities, onential b o ost Bo to olean statistical , ducts, on toporofRBF the exponential with saturating Boolean gates,efficiency sum/pro sum/products, units) with given by using distributed representation. Manappro y kinds of networks (e.g., ab oost single hidden lay layer eracan be sho shown wn to b e universal approximators. ximators. A model with saturating nonlinearities, Bo olean gates, sum/pro ducts, or RBF units) with family that is a universal approximator can appro approximate ximate a large class of functions a single hidden layer can be shown toan bye non-zero universaltolerance approximators. A enough model (including all con continuous tinuous functions) up to any level, giv given en family that is aHo universal approximator canber appro ximate units a largemay class functions hidden units. How wev ever, er, the required num number of hidden b eofvery large. (including all con tinuous functions) up to an y non-zero tolerance level, giv en enough Theoretical results concerning the expressive p ower of deep architectures state that hidden How er, the required ber of hidden units may very large. there areunits. families ofev functions that can num b e represented efficiently by anb earchitecture Theoretical results concerning the expressive p o w er of deep architectures state that of depth k, but would require an exp exponen onen onential tial num numb b er of hidden units (with resp respect ect there families functions thatt can b e (depth represented to theare input size) of with insufficien insufficient depth 2 or efficiently depth k −b1y).an architecture of depth k, but would require an exp onential numb er of hidden units (with resp ect In Sec. that deterministic feedforw feedforward ard netw networks to the input6.4.1 size), we withsaw insufficien t depth (depth 2 or depth k orks 1). are universal appro approximators ximators of functions. Many structured probabilistic mo models dels with a single −orks are universal In Sec. 6.4.1 , we saw that deterministic feedforw ard netw hidden lay layer er of latent variables, including restricted Boltzmann mac machines hines and deep appro structured mo dels with(Le a single b elief ximators net netw works,of arefunctions. universal Many appro approximators ximators of probabilistic probability distributions Roux hidden lay er of latent v ariables, including restricted Boltzmann mac hines and and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause etdeep al. al.,, b elief net w orks, are universal appro ximators of probability distributions ( Le Roux 2013 2013). ). and Bengio, 2008, 2010; Montúfar and Ay, 2011; Montúfar, 2014; Krause et al., In Sec. 6.4.1, we sa saw w that a sufficien sufficiently tly deep feedforward netw network ork can ha hav ve an 2013). In Sec. 6.4.1, we saw that a sufficien556 tly deep feedforward network can have an
CHAPTER 15. REPRESENTATION LEARNING
exp exponen onen onential tial adv advan an antage tage ov over er a netw network ork that is to too o shallow. Suc Such h results can also b e obtained for other models suc such h as probabilistic mo models. dels. One suc such h probabilistic exp onen tial adv an tage ov er a netw ork that is to o shallow. Suc h results also mo model del is the sum-pr sum-pro oduct network or SPN (Po on and Domingos, 2011).can These b e obtained for other models suchtoascompute probabilistic mo dels. One such probabilistic mo models dels use polynomial circuits the probabilit probability y distribution over a mo del is the sum-pr o duct network or SPN ( P o on and Domingos , 2011 ). set of random variables. Delalleau and Bengio (2011) sho show wed that thereThese exist mo dels use polynomial circuits to acompute probabilit y is distribution veroida probabilit probability y distributions for which minimumthe depth of SPN required tooav avoid set of random vonentially ariables. Delalleau and Later, BengioMartens (2011) sho that there(2014 exist) needing an exp exponentially large mo model. del. andwed Medabalimi probabilit y distributions for whichdifferences a minimumbetw depth SPN tis to avoid sho show wed that there are significant between eenofevery worequired finite depths of needing an exp onentially large mo del. Later, Martens and Medabalimi ( 2014 SPN, and that some of the constraints used to make SPNs tractable may limit) showed that there arepsignificant differences between every two finite depths of their represen representational tational ower. SPN, and that some of the constraints used to make SPNs tractable may limit Another in interesting teresting developmen developmentt is a set of theoretical results for the expressiv expressivee their representational p ower. p ower of families of deep circuits related to conv convolutional olutional nets, highlighting an Another in teresting developmen t is a set of theoretical results for the expressiv exp exponen onen onential tial adv advan an antage tage for the deep circuit ev even en when the shallow circuit is allo allow wede p ow er ofappro families of deep circuits related to conv olutional nets, highlighting an, to only approximate ximate the function computed by the deep circuit (Cohen et al. exp onen tialcomparison, advantage for the deep circuit evw enork when theclaims shallow circuit isonly allow ed 2015 2015). ). By previous theoretical made regarding the to only appro function by the particular deep circuit (Cohen et al., case where theximate shallowthe circuit mustcomputed exactly replicate functions. 2015). By comparison, previous theoretical work made claims regarding only the case where the shallow circuit must exactly replicate particular functions.
15.6
Pro Providing viding Clues to Disco Discov ver Underlying Causes
T o close this weClues come back one of v our questions: Causes what makes 15.6 Prochapter, viding totoDisco eroriginal Underlying one representation b etter than another? One answ answer, er, first introduced in Sec. 15.3, T o close this c hapter, we come back to one of our original questions:causal what factors makes is that an ideal represen representation tation is one that disentangles the underlying onevariation representation b etter than another? One answ first introduced in ant Sec.to15.3 of that generated the data, esp especially ecially thoseer,factors that are relev relevant our, is that an idealMost represen tation for is one that disentangles the are underlying factors applications. strategies represen representation tation learning based oncausal introducing of variation thatthe generated data,these esp ecially thosefactors factorsof that are relevant our clues that help learningthe to find underlying variations. Thetoclues applications. Most strategies for represen tation learning are based on introducing can help the learner separate these observed factors from the others. Sup Supervised ervised clues thatprovides help thea learning to find these underlying factors of veac ariations. The clues learning very strong clue: a lab label el y, presented with each h x , that usually can help the thevlearner separate these observed fromdirectly the others. Sup ervised, sp specifies ecifies alue of at least one of the factors factors of variation directly. . More generally generally, learning a very strong clue: lab el y , presented with each x ,makes that usually to mak makee provides use of abundant unlab unlabeled eledadata, representation learning use of sp ecifies the v alue of at least one of the factors of v ariation directly . More generally other, less direct, hin hints ts ab about out the underlying factors. These hin hints ts take the form of, to make prior use ofb eliefs abundant unlab eled data, representation makes use of implicit that we, the designers of the learninglearning algorithm, imp impose ose in other, to lessguide direct, ts ab out the underlying factors. These hin ts take the form of order thehin learner. Results such as the no free lunc lunch h theorem sho show w that implicit prior bstrategies eliefs thatare we,necessary the designers of the learning algorithm,While imp ose regularization to obtain go goo o d generalization. it in is order to guide the learner. Results such as the no free lunc h theorem sho w that imp impossible ossible to find a universally sup superior erior regularization strategy strategy,, one goal of deep regularization strategies necessary to obtain go o d generalization. it is learning is to find a set of are fairly generic regularization strategies that areWhile applicable impa ossible to findofaAI universally sup erior regularization strategy goal are of deep to wide variety tasks, similar to the tasks that people and, one animals able learning is to find a set of fairly generic regularization strategies that are applicable to solve. to a wide variety of AI tasks, similar to the tasks that people and animals are able We pro provide vide here a list of these generic regularization strategies. The list is to solve. 557 regularization strategies. The list is We provide here a list of these generic
CHAPTER 15. REPRESENTATION LEARNING
clearly not exhaustive, but gives some concrete examples of wa ways ys that learning algorithms can b e encouraged to discov discover er features that corresp correspond ond to underlying clearly not but givesinsome examples ways) that learning factors. Thisexhaustive, list was introduced Sec. concrete 3.1 of Bengio et al. of (2013d and has b een algorithms can b e encouraged to discov er features that corresp ond to underlying partially expanded here. factors. This list was introduced in Sec. 3.1 of Bengio et al. (2013d) and has b een partially expanded • Smo Smoothness othness:here. This is the assumption that f (x + d) ≈ f( x ) for unit d
and small . This assumption allo allows ws the learner to generalize from training Smo othness f( x ) for fy(x + d) learning : This is the assumption unit d examples to nearb nearby y p oin oints ts in input space.that Man Many machine algorithms and small . idea, This but assumption allows the from trainingy. • lev ≈ of dimensionalit leverage erage this it is insufficient to ovlearner ercometo thegeneralize curse dimensionality examples to nearby p oints in input space. Many machine learning algorithms erage this idea, but it is insufficient to oassume vercomethat the curse of dimensionalit y. learning algorithms relationships betw between een • lev Linearit Linearity y: Many some variables are linear. This allows the algorithm to make predictions ev even en y: Many learningdata, algorithms that lead relationships een vLinearit ery far from the observed but canassume sometimes to ov overly erly betw extreme some v ariables are linear. This allows the algorithm to make predictions ev en • predictions. Most simple mac machine hine learning algorithms that do not mak makee the veryothness far fromassumption the observed data, make but can lead to overly extreme smo smoothness instead thesometimes linearity assumption. These are predictions. Most simple machine learning algorithms that do nottsmak e the in fact differen different t assumptions—linear functions with large weigh eights applied smo othness assumption instead linearity These to high-dimensional spaces ma may y make not b ethe very smo smooth. oth.assumption. See Go Goo o dfello dfellow w etare al. in fact differen t assumptions—linear functions with large w eigh ts applied (2014b) for a further discussion of the limitations of the linearity assumption. to high-dimensional spaces may not b e very smo oth. See Go o dfellow et al. 2014b) for explanatory a further discussion of: the limitations of the linearity • (Multiple factors Many representation learningassumption. algorithms are motiv motivated ated by the assumption that the data is generated by multiple Multiple factors:and Many learning algorithms underlying explanatory explanatory factors, thatrepresentation most tasks can b e solved easily are motiv the assumption that the generated by this multiple • giv given en the ated state by of each of these factors. Sec.data 15.3isdescrib describes es how view underlying explanatory factors, and that most tasks can b e solved easily motiv motivates ates semi-sup semi-supervised ervised learning via representation learning. Learning giv en the state of each of theselearning factors.some Sec. of 15.3 es how this the structure of p( x) requires thedescrib same features thatview are motiv ates semi-sup ervised learning via representation learning. Learning useful for mo modeling deling p(y | x) b ecause b oth refer to the same underlying the structurefactors. of p( x)Sec. requires learning some of the same features that explanatory 15.4 describes ho how w this view motiv motivates ates the useare of p ( y x useful for mo deling ) b ecause b oth refer to the same underlying distributed representations, with separate directions in represen representation tation space explanatory factors. Sec. 15.4 describes ho w this view motiv ates the use of | corresp corresponding onding to separate factors of variation. distributed representations, with separate directions in representation space • corresp Causalonding factors to: separate factors of variation. the mo model del is constructed in suc such h a way that it treats the factors of variation describ described ed by the learned represen representation tation h as the causes Causal factors : the mo del is constructed in suc h a way in that it treats the of the observ observed ed data x, and not vice-v vice-versa. ersa. As discussed Sec. 15.3, this factors of variation edervised by the learning learned represen tation as the causes • is adv for describ semi-sup and makes thehlearned mo advantageous antageous semi-supervised model del of the observ ed data , and not vice-v ersa. As discussed in Sec. 15.3 , this x more robust when the distribution ov over er the underlying causes changes or is adv antageous for semi-sup ervised learning and makes the learned mo del when we use the mo model del for a new task. more robust when the distribution over the underlying causes changes or • when Depth we usea the mo del for aorganization new task. , or hierarchical of explanatory factors factors:: Highlev level, el, abstract concepts can b e defined in terms of simple concepts, forming a Depth hierarchical organization of ofexplanatory factors : Highhierarc hierarch h,yor . Farom another p oint of view, the use a deep arc architecture hitecture expresses level,b elief abstract b e defined in terms of simple a • our thatconcepts the taskcan should b e accomplished via a mconcepts, ulti-step forming program, hierarchy. From another p oint of view, the use of a deep architecture expresses our b elief that the task should b558 e accomplished via a multi-step program,
CHAPTER 15. REPRESENTATION LEARNING
with eac each h step referring back to the output of the pro processing cessing accomplished via previous steps. with each step referring back to the output of the pro cessing accomplished previous steps.across tasks: In the con context text where we hav havee many tasks, • via Shared factors corresp corresponding onding to differen differentt y i variables sharing the same input x or where Shared across tasks : In the text where many tasks, f (i) (we x) hav eac each h taskfactors is asso associated ciated with a subset orcon a function of ae global input corresp onding to differen t y v ariables sharing the same input or where x • x , the assumption is that each y i is asso associated ciated with a different subset from a f ( x) oofverlap, eac h task is asso ciated with a subset or a function a global input common p o ol of relev relevant ant factors h. Because these subsets learning x, the is that each yintermediate is asso ciatedrepresentation with a different P ( yi | x) via P (subset h | x) from all theassumption a shared allo allows wsa h.een commonofp ostatistical ol of relevstrength ant factors Because these subsets overlap, learning sharing b etw etween the tasks. all the P ( y x) via a shared intermediate representation P (h x) allows of statistical strength etweentrates, the tasks. : |Probability mass bconcen concentrates, and the regions in which it con• sharing Manifolds | cen centrates trates are lo locally cally connected and o ccupy a tiny volume. In the con contin tin tinuous uous Manifolds : Probability mass concen trates, and the regions in which it concase, these regions can be appro approximated ximated by low-dimensional manifolds with cen trates are lo cally connected and o ccupy a tiny volume. In the con tin uous • a muc much h smaller dimensionalit dimensionality y than the original space where the data lives. case,ythese regions can bealgorithms approximated bye low-dimensional with Man Many machine learning behav behave sensibly only onmanifolds this manifold muc h smaller dimensionalit y thanmachine the original spacealgorithms, where the data lives. (aGo Goodfellow odfellow et al. , 2014b). Some learning esp especially ecially Man y machineattempt learningtoalgorithms behavthe e sensibly only on this manifold auto autoencoders, encoders, explicitly learn structure of the manifold. (Go odfellow et al., 2014b). Some machine learning algorithms, esp ecially encoders, attempt to explicitly learn the structure of the manifold. • auto Natural clustering clustering: : Many machine learning algorithms assume that eac each h connected manifold in the input space ma may y b e assigned to a single class. The Natural clustering : Many machine learning algorithms each data ma may y lie on many disconnected manifolds, but the class assume remainsthat constant connected manifold in the input space may b emotiv assigned singleofclass. The • within each one of these. This assumption motivates ates to a vaariety learning data may lieincluding on many disconnected manifolds,double but the class remains constant algorithms, tangent propagation, backprop, the manifold within one of these. This assumption tangen tangentteach classifier and adversarial training. motivates a variety of learning algorithms, including tangent propagation, double backprop, the manifold toral classifier adversarial training. : Slow feature analysis and related • tangen Temp emporal and and spatial coherence algorithms make the assumption that the most imp important ortant explanatory factors T emp oral and spatial coherence : Slow feature andthe related change slo slowly wly over time, or at least that it is easieranalysis to predict true the assumption thattothe most ra imp ortant explanatory factors • algorithms underlying make explanatory factors than predict observ such as pixel raw w observations ations change slowly er time, or at description least that itof isthis easier to predict the true v alues. See Sec.ov13.3 for further approach. underlying explanatory factors than to predict raw observations such as pixel • vSparsit Sparsity y: Most alues. See Sec. 13.3 for further thisbapproach. features shoulddescription presumablyofnot e relev relevant ant to describing most inputs—there is no need to use a feature that detects elephant trunks Sparsit y: Most should presumably not b ereasonable relevant totodescribing when represen representing tingfeatures an image of a cat. It is therefore imp impose ose a most inputs—there is no need to use a feature that detects elephant trunks • prior that any feature that can b e in interpreted terpreted as “present” or “absent” should when represen ting an image of a cat. It is therefore reasonable to imp ose a b e absen absentt most of the time. prior that any feature that can b e interpreted as “present” or “absent” should • bSimplicit Simplicity of F Dep Dependencies endencies: In go e absent y most ofactor the time. goo o d high-level representations, the factors are related to eac other through simple dep each h dependencies. endencies. The Simplicit y of F actor Dep endencies : In go o d high-level hi), but linear simplest p ossible is marginal indep independence, endence, P (h ) = i P (representations, • the factors are related to each other through simple dep endencies. The P (h ), but linear simplest p ossible is marginal indep endence, P (h ) = 559 Q
Q
CHAPTER 15. REPRESENTATION LEARNING
dep dependencies endencies or those captured by a shallow auto autoenco enco encoder der are also reasonable assumptions. This can b e seen in man many y la laws ws of physics, and is assumed dep endencies or athose captured by or a shallow auto enco deron aretop alsoofreasonable when plugging linear predictor a factorized prior a learned assumptions. This can b e seen in man y la ws of physics, and is assumed represen representation. tation. when plugging a linear predictor or a factorized prior on top of a learned represen tation. The concept of representati representation on learning ties together all of the man many y forms of deep learning. Feedforw eedforward ard and recurren recurrentt net netw works, auto autoenco enco encoders ders and deep The concept of representati on learning ties together all of the manthe y forms probabilistic mo models dels all learn and exploit represen representations. tations. Learning b est ofossible deep learning. Feedforw ard and recurren t en net orks, auto p represen representation tation remains an exciting av aven enue uewof researc research. h.enco ders and deep probabilistic mo dels all learn and exploit representations. Learning the b est p ossible representation remains an exciting avenue of research.
560
Chapter 16 Chapter 16
Structured Probabilistic Mo Models dels for Deep Learning Structured Probabilistic Models for Deep Learning Deep learning dra draws ws up upon on man many y mo modeling deling formalisms that researc researchers hers can use to
Deep learning draws upon many modeling formalisms that researchers can use to guide their design efforts and describ describee their algorithms. One of these formalisms is Deep learning dra ws up on man y mo deling that researc hers structured can use to the idea of structur structureed pr prob ob obabilistic abilistic mo models dels dels.. formalisms We hav havee already discussed guide their design efforts andindescrib e their algorithms. One of these is probabilistic mo models dels briefly Sec. 3.14 . That brief presentation wasformalisms sufficient to the idea of structur d prstructured obabilistic probabilistic models. We hav edels already discussed to structured understand how to euse mo models as a language describ describee probabilistic mo dels briefly in Sec. 3.14 . That brief presentation was sufficient to some of the algorithms in Part II. Now, in Part III, structured probabilistic mo models dels understand how to use structured probabilistic mo dels as a language to describ e are a key ingredient of many of the most important research topics in deep learning. some of the algorithms in Partthese II. Now, in hPideas, art III,this structured models In order to prepare to discuss researc research chapter probabilistic describ describes es structured are a key ingredient of greater the mostdetail. important in deeptolearning. probabilistic mo models delsofinmany muc uch h This research chapter topics is intended be selfIn order to prepare to discuss these researc h ideas, this c hapter describ es structured con contained; tained; the reader do does es not need to review the earlier introduction before probabilistic mo dels in m uch greater detail. This chapter is intended to be selfcon contin tin tinuing uing with this chapter. contained; the reader does not need to review the earlier introduction before A structured probabilistic mo model del is a way of describing a probabilit probability y distribution, continuing with this chapter. using a graph to describ describee which random variables in the probability distribution A structured probabilistic mo. del is aww y of“graph” describing a probabilit y distribution, in interact teract with each other directly directly. Here e ause in the graph theory sense—a using a graph to describ e which random v ariables in the probability distribution set of vertices connected to one another by a set of edges. Because the structure of interact with each other Here mo wedels use “graph” the referred graph theory sense—a the mo model del is defined by adirectly graph,.these models are ofteninalso to as gr graphic aphic aphical al set of v ertices connected to one another by a set of edges. Because the structure of mo models dels dels.. the model is defined by a graph, these models are often also referred to as graphical The graphical mo models dels research communit community y is large and has dev develop elop eloped ed man many y models. differen differentt mo models, dels, training algorithms, and inference algorithms. In this chapter, we The basic graphical models on research communit y central is large ideas and has developedmo man y pro provide vide background some of the most of graphical models, dels, differen models, training algorithms, algorithms. thisdeep chapter, we with antemphasis on the concepts thatand ha hav veinference pro prov ven most useful toInthe learning pro vide basic background on some of the most central ideas of graphical mo dels, researc research h communit community y. If you already hav havee a strong bac background kground in graphical mo models, dels, with an emphasis on the concepts that ha v e pro v en most useful to the deep learning you may wish to skip most of this chapter. How However, ever, even a graphical mo model del exp expert ert research community. If you already have a strong background in graphical models, 561 However, even a graphical model expert you may wish to skip most of this chapter. 561
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
ma may y benefit from reading the final section of this chapter, Sec. 16.7, in which we highligh highlightt some of the unique wa ways ys that graphical mo models dels are used for deep learning ma y b enefit from reading the final section of this chapter, Sec.t16.7 , instructures, which we algorithms. Deep learning practitioners tend to use very differen different mo model del highlightalgorithms some of theand unique ways pro thatcedures graphical moare delscommonly are used for deep learning inference procedures than used by learning the rest algorithms. Deep learning practitioners tend to use very differen t mo del structures, of the graphical mo models dels research communit community y. In this chapter, we iden identify tify these learning algorithms and inference procedures than for arethem. commonly used by the rest differences in preferences and explain the reasons of the graphical models research community. In this chapter, we identify these In this chapter we first describ describee the challenges of building large-scale probadifferences in preferences and explain the reasons for them. bilistic mo models. dels. Next, we describ describee how to use a graph to describ describee the structure In this chapter we first describ e the challenges of building of a probability distribution. While this approac approach h allows us tolarge-scale overcome probamany models. Next, we describ e how to use a graph tothe describ e the structure cbilistic hallenges, it is not without its own complications. One of ma major jor difficulties in of a probability distribution. While this approac h allows us to o v ercome many graphical mo modeling deling is understanding whic which h variables need to be able to interact cdirectly hallenges, it is not without its own complications. One of maen jorproblem. difficultiesW ine directly,, i.e., which graph structures are most suitable forthe a giv given We graphical mo deling is understanding whic h v ariables need to b e able to interact outline tw two o approaches to resolving this difficulty by learning ab about out the dep dependenendendirectly , i.e.,16.5 which graph structures area most suitable for unique a givenemphasis problem. that We cies in Sec. . Finally Finally, , we close with discussion of the outline two approaches to resolving difficulty by learning about the dep endendeep learning practitioners place onthis sp specific ecific approaches to graphical mo modeling deling in cies in Sec. 16.5 . Finally , we close with a discussion of the unique emphasis that Sec. 16.7. deep learning practitioners place on specific approaches to graphical modeling in Sec. 16.7.
16.1
The Challenge of Unstructured Mo Modeling deling
The of deepChallenge learning is to of scale machine learning to thedeling kinds of challenges 16.1goal The Unstructured Mo needed to solve artificial intelligence. This means being able to understand highThe goal of data deep with learning to scale machine learning to thelike kinds of challenges dimensional rich is structure. For example, we would AI algorithms to needed artificial intelligence. means being representing able to understand b e able to to solve understand natural images,1This audio waveforms sp speech, eech,highand dimensional data with rich structure. or example, we would like AI algorithms to do documen cumen cuments ts containing multiple words Fand punctuation characters. be able to understand natural images, audio waveforms representing speech, and Classification algorithms canwords tak takee an from suc such h a ric rich h high-dimensional documen ts containing multiple andinput punctuation characters. distribution and summarize it with a categorical lab label—what el—what ob object ject is in a photo, Classification algorithms can takwhat e an input such at ric h high-dimensional what word is sp spoken oken in a recording, topic from a do documen cumen cument is ab about. out. The pro process cess distribution and summarize it with a categorical lab el—what ob ject is in a photo, of classification discards most of the information in the input and pro produces duces a what w ord is sp oken in a recording, what topic a do cumen t is ab out. The proThe cess single output (or a probabilit probability y distribution ov over er values of that single output). of classification discards most of the information in the input and pro duces a classifier is also often able to ignore many parts of the input. For example, when single output aject probabilit y distribution over values to of ignore that single output). The recognizing an(or ob object in a photo, it is usually possible the background of classifier is also often able to ignore many parts of the input. F or example, when the photo. recognizing an ob ject in a photo, it is usually possible to ignore the background of It is possible to ask probabilistic mo models dels to do man many y other tasks. These tasks are the photo. often more exp expensive ensive than classification. Some of them require pro producing ducing multiple It is vpalues. ossibleMost to askrequire probabilistic modelsunderstanding to do many other tasks. These tasks are output a complete of the en entire tire structure of often more expensive than classification. Some of them require producing multiple 1 A natural image is an image that might be understanding captured by a camera in aenreasonably ordinary output values. Most require a complete of the tire structure of environment, as opposed to a synthetically rendered image, a screenshot of a web page, etc. 562
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
the input, with no option to ignore sections of it. These tasks include the follo following: wing: the •input, with no option to ignore sections of it.machine These tasks include thereturns following: x, the Densit Density y estimation: giv given en an input learning system an
estimate of the true density p(x) under the data generating distribution. This x, do Density estimation: givoutput, en an input theesmachine system returns an requires only a single but it does require learning a complete understanding (x)element estimate of the true density under the generating distribution. This • of the en entire tire input. If even pone of data the vector is un unusual, usual, the system requires only a single output, but it do es require a complete understanding must assign it a lo low w probabilit probability y. of the entire input. If even one element of the vector is unusual, the system ˜ , the machine • Denoising: must assigngiv it en a loawdamaged probabilitory. incorrectly observed input x given learning system returns an estimate of the original or correct x. For example, x ˜ , the Denoising: en a damaged incorrectly observed input machine the machinegiv learning system or migh might t be asked to remov remove e dust or scratches x.elemen learning system returns an estimate of the original or correct For example, • from an old photograph. This requires multiple outputs (every element t of the the machine learning system migh t b e asked to remov e dust or scratches estimated clean example x) and an understanding of the entire input (since from an old photograph. This requires ultiple elemen t of the ev even en one damaged area will still revealm the finaloutputs estimate(every as being damaged). estimated clean example x) and an understanding of the entire input (since • Missing even one vdamaged area will still reveal the final estimate as being damaged). alue imputation: giv given en the observ observations ations of some elements of x, the mo model del is asked to return estimates of or a probabilit probability y distribution ov over er x Missing v alue imputation: giv en the observ ations of some elements of some or all of the unobserved elements of x. This requires multiple outputs., the modelthe is mo asked return of or a probabilit distribution • Because model del to could be estimates ask asked ed to restore an any y of they elemen elements ts of xov , er it x some or all of the unobserved elements of . This requires multiple outputs. must understand the en entire tire input. Because the model could be asked to restore any of the elements of x, it • Sampling: must understand the tire input.new samples from the distribution p(x). the mo model delengenerates Applications include sp speech eech synthesis, i.e. pro producing ducing new wa waveforms veforms that p(x)a. Sampling: the mo del generates new samples from the distribution sound like natural human sp speech. eech. This requires multiple output values and Applications include sp eech synthesis, i.e. pro ducing new wa veforms that • go goo od mo model del of the entire input. If the samples ha hav ve ev even en one element drawn soundthe likewrong natural human speech. requires multiple values and a from distribution, thenThis the sampling pro process cess output is wrong. good model of the entire input. If the samples have even one element drawn the wrong thenusing the sampling process is wrong. Forfrom an example of distribution, a sampling task small natural images, see Fig. 16.1.
Mo Modeling a ric rich h distribution over thousands or millions randomsee variables is .a F ordeling an example of a sampling task using small naturalofimages, Fig. 16.1 challenging task, both computationally and statistically statistically.. Supp Suppose ose we only wan wanted ted Mo deling a ric h distribution o v er thousands or millions of random v ariables is to mo model del binary variables. This is the simplest possible case, and yet already ita challenging task, bothFor computationally and statistically . Supp ose we onlyare wan ted seems ov overwhelming. erwhelming. a small, 32 × 32 pixel color (RGB) image, there 23072 toossible modelbinary binaryimages variables. This is the simplest possible yetlarger already p of this form. This num numb ber is ov over er case, 10800 and times thanit seems overwhelming. or atoms a small,in32 pixel color (RGB) image, there are 2 the estimated num umb berFof the 32 universe. possible binary images of this form. ×This number is over 10 times larger than general, n ifum webwish mo model del distribution over er a random vector x con containing taining the In estimated er of to atoms inathe universe. ov n discrete variables capable of taking on k values eac each, h, then the naive approach of In general, to mo del a distribution er probability a random vector represen representing ting P (ifxw ) ebywish storing a lo lookup okup table with ov one valuexpcon er ptaining ossible n k discrete v ariables capable of taking on v alues eac h, then the naive approach of n outcome requires k parameters! representing P (x) by storing a lookup table with one probability value per possible This isrequires not feasible for sev several eral reasons: outcome k parameters! This is not feasible for several reasons: 563
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Figure 16.1: Probabilistic mo modeling deling of natural images. (T (Top) op) Example 32 × 32 pixel color images from the CIF CIFAR-10 AR-10 dataset (Krizhevsky and Hinton, 2009). (Bottom) Samples op)dataset. 32h sample 32 pixel Figure 16.1: aProbabilistic mo deling ofmo natural images. Example color dra drawn wn from structured probabilistic model del trained on (T this Eac Each app appears ears (Bottom) images from the CIF AR-10 dataset ( Krizhevsky and Hinton , 2009 ). Samples × at the same p osition in the grid as the training example that is closest to it in Euclidean drawn from structured allows probabilistic mothat del trained on dataset. Each sample ears space. This acomparison us to see the mo model delthis is truly synthesizing new app images, at the same osition in the as the data. training example is closest it in Euclidean rather than pmemorizing thegrid training Con Contrast trast ofthat b oth sets of to images has b een space. This allows us to see with that the mo del isfrom trulyCourville synthesizing images, adjusted for comparison displa display y. Figure reproduced p ermission et al.new (2011 ). rather than memorizing the training data. Contrast of b oth sets of images has b een adjusted for display. Figure reproduced with p ermission from Courville et al. (2011).
564
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
• Memory: the cost of storing the represen representation tation: For all but very small values of n and k, represen representing ting the distribution as a table will require Memory: the to cost of storing the representation: For all but very to too o man many y values store. n k • small values of and , representing the distribution as a table will require o many values to store. : As the num numb ber of parameters in a mo model del increases, • to Statistical efficiency so do does es the amount of training data needed to choose the values of those Statistical efficiency : As the estimator. number of parameters a model increases, parameters using a statistical Because thein table-based model so do es the amount of training data needed to choose the v alues of those • has an astronomical num numb ber of parameters, it will require an astronomically parameters using a fit statistical estimator. Because the otable-based model large training set to accurately accurately. . Any such mo model del will verfit the training has very an astronomical ber of parameters, willmade require an astronomically set badly unlessnum additional assumptionsit are linking the differen differentt large training set to fit accurately . Any such mo del will o v erfit the en entries tries in the table (for example, lik likee in bac back-off k-off or smo smoothed othed othedn mo models, dels, n -gram training set very badly unless additional assumptions are made linking the differen t Sec. 12.4.1). entries in the table (for example, like in back-off or smoothed n -gram models, • Sec. Run Runtime: time:).the cost of inference: Supp 12.4.1 Suppose ose we wan antt to perform an inference task where we use our mo model del of the joint distribution P (x) to compute some Runtime: the cost of inference : Supp ose we wanP t to erform an inference (x p other distribution, suc such h as the marginal distribution 1) or the conditional (x) to task where we del of the these joint distribution compute some • distribution P (use x2 | our Computing distributionsP will require summing x1 ).mo P (x ) oristhe other distribution, such as distribution conditional across the entire table, so the the marginal runtime of these op operations erations as high as the P ( distribution x ) . Computing these distributions will require summing x in intractable tractable memory cost of storing the mo model. del. across the entire table, so the runtime of these operations is as high as the | • in Run Runtime: time: memory the costcost of of sampling tractable storing :the mo del. supp Lik Likewise, ewise, suppose ose we wan antt to draw a sample from the mo model. del. The naive wa way y to do this is to sample some value of through sampling : Lik ewise, supp want to draw a (0 (0,, 1) uRun ∼ Utime: 1),, the thencost iterate the table adding upose theweprobability values sample from the mo naive y to do whose this is probability to sample some • un u del. until til they exceed and The return the wa outcome valuevalue was (0 , u U 1) , then iterate through the table adding up the probability v alues added last. This requires reading through the whole table in the worst case, un∼ til theythe exceed and return theasoutcome whose probability value was so it has same uexp exponen onen onential tial cost the other op operations. erations. added last. This requires reading through the whole table in the worst case, so it has the same exponential cost as the other operations. The problem with the table-based approach is that we are explicitly mo modeling deling ev every ery possible kind of in interaction teraction bet etw ween ev every ery possible subset of variables. The The problem with the we table-based is thatare wemuc arehexplicitly modeling probabilit probability y distributions encoun encounter terapproach in real tasks much simpler than this. ev ery p ossible kind of in teraction b et w een ev ery p ossible subset of v ariables. The Usually Usually,, most variables influence each other only indirectly indirectly.. probability distributions we encounter in real tasks are much simpler than this. For example, consider mo modeling deling the finishing times of a team in a relay race. Usually, most variables influence each other only indirectly. Supp Suppose ose the team consists of three runners: Alice, Bob and Carol. At the start of or example, consider moand deling the finishing times of a team in acompleting relay race. the F race, Alice carries a baton begins running around a trac track. k. After Supplap osearound the team runners: Alice,toBob Carol. t the his start of her theconsists track, of shethree hands the baton Bob.and Bob thenAruns own the race, Alice carries a baton and b egins running around a trac k. After completing lap and hands the baton to Carol, who runs the final lap. We can mo model del eac each h of her aroundtimes the track, she hands the baton to Bob. Bob then runs his o wn theirlap finishing as a contin random v ariable. Alice’s finishing time do continuous uous does es lap and hands baton to Carol, lap.finishing We can mo deldep eacends h of not dep depend end on the any anyone one else’s, sincewho she runs go goes es the first.final Bob’s time depends their finishing timesBob as ado contin vortunit ariable.y Alice’s time Alice does on Alice’s, because does es notuous hav haveerandom the opp opportunit ortunity to startfinishing his lap until not dep end on any one else’s, since she go es first. Bob’s finishing time dep ends has completed hers. If Alice finishes faster, Bob will finish faster, all else being on Alice’s, because Bob does not have the opportunity to start his lap until Alice 565 has completed hers. If Alice finishes faster, Bob will finish faster, all else being
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
equal. Finally Finally,, Carol’s finishing time dep depends ends on both her teammates. If Alice is slo slow, w, Bob will probably finish late to too. o. As a consequence, Carol will ha hav ve quite a equal. Finally , Carol’s finishing time dep ends on b oth her teammates. Alice is late starting time and thus is lik likely ely to hav havee a late finishing time as well.IfHow However, ever, slow, Bob will probably finish o. As a consequence, Carol will havvia e quite a Carol’s finishing time dep depends endslate onlytoindir indire ectly on Alice’s finishing time Bob’s. late starting time and thus is lik ely to hav e a late finishing time as well. How ever, If we already know Bob’s finishing time, we will not be able to estimate Carol’s Carol’s finishing timeby dep ends only on Alice’s time via means Bob’s. finishing time better finding out indir whatectly Alice’s finishingfinishing time was. This Ifewe Bob’s time, will not be Alice’s able toeffect estimate Carol’s w canalready mo model del know the rela relay y racefinishing using only twowein interactions: teractions: on Bob and finishing time b etter by finding out what Alice’s finishing time was. This means Bob’s effect on Carol. We can omit the third, indirect interaction bet etw ween Alice w e can mo del the rela y race using only t w o in teractions: Alice’s effect on Bob and and Carol from our mo model. del. Bob’s effect on Carol. We can omit the third, indirect interaction between Alice models dels provide a formal framework for mo modeling deling only andStructured Carol fromprobabilistic our model. mo direct interactions betw etween een random variables. This allo allows ws the mo models dels to hav havee Structured probabilistic mo dels provide a formal framework for mo deling significan significantly tly fewer parameters whic which h can in turn be estimated reliably from only less direct interactions b etw een random v ariables. This allo ws the mo dels to hav data. These smaller mo models dels also hav havee dramatically reduced computational coste significan tlystoring fewer parameters h can ininference turn be estimated reliably less in terms of the mo model, del, whic performing in the mo model, del, andfrom drawing data. These models also have dramatically reduced computational cost samples fromsmaller the mo model. del. in terms of storing the model, performing inference in the model, and drawing samples from the model.
16.2
Using Graphs to Describ Describe e Mo Model del Structure
Structured probabilistic mo models delsto useDescrib graphs (in etheMo graph sense of “no “nodes” des” or 16.2 Using Graphs deltheory Structure “v “vertices” ertices” connected by edges) to represent interactions betw etween een random variables. Structured probabilistic mo dels use graphs (in the graph theory “nodes” or Eac Each h no node de represen represents ts a random variable. Eac Each h edge represents asense directofinteraction. “vertices” connected by edges) to other, represent interactions between variables. These direct interactions imply indirect interactions, butrandom only the direct Eac h node represen Each edge represents a direct interaction. in interactions teractions need totsbae random explicitlyvariable. mo modeled. deled. These direct interactions imply other, indirect interactions, but only the direct There is more than one way to describe the in interactions teractions in a probabilit probability y interactions need to be explicitly modeled. distribution using a graph. In the following sections we describ describee some of the most There is more than one w a y to describe the in teractions in adivided probabilit popular and useful approac approaches. hes. Graphical models can be largely in into toy usingmo a dels graph. In the sectionsgraphs, we describ some of based the most tdistribution wo categories: models based on following directed acyclic ande mo models dels on p opular and useful approac hes. Graphical models can b e largely divided in to undirected graphs. two categories: models based on directed acyclic graphs, and models based on undirected graphs.
16.2.1
Directed Mo Models dels
One kind of structuredMo probabilistic mo model del is the dir direecte cted d gr graphic aphic aphical al mo model del del,, otherwise 16.2.1 Directed dels 2 kno as the b elief network or Bayesian network ( P earl , 1985 ). known wn One kind of structured probabilistic model is the directed graphical model, otherwise mo models delsorare called “directed” their).edges are directed, knoDirected wn as thegraphical belief network Bayesian network b(ecause Pearl, 1985 2
Judea Pearl suggested using “Bayesian network” whentheir one wishes Directed graphical mo delsthe areterm called “directed” because edges to are“emphasize directed, the judgmental” nature of the values computed by the network, i.e. to highlight that they usually represent degrees of belief rather than frequencies of events. 566
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Alice
Bob
Carol
t0
t1
t2
Figure 16.2: A directed graphical mo model del depicting the relay race example. Alice’s finishing time t0 influences Bob’s finishing time t 1 , because Bob do does es not get to start running until Figurefinishes. 16.2: A directed graphical model depicting relay race example. Alice’s so finishing Alice Lik Likewise, ewise, Carol only gets to startthe running after Bob finishes, Bob’s time t influences Bob’s finishing time t , because Bobtime doest2not finishing time t1 directly influences Carol’s finishing . get to start running until Alice finishes. Likewise, Carol only gets to start running after Bob finishes, so Bob’s finishing time t directly influences Carol’s finishing time t .
that is, they point from one vertex to another. This direction is represented in the dra drawing wing with an arro arrow. w. The direction of the arrow indicates which variable’s that is, they point fromisone vertex another. direction is represented in probabilit probability y distribution defined in to terms of the This other’s. Drawing an arrow from wing with w. The direction y of distribution the arrow indicates which variable’s athe todra b means thatan wearro define the probabilit probability ov over er b via a conditional probability distribution is defined terms ofonthe Drawing arrow from distribution, with a as one of the in variables theother’s. right side of theanconditioning a to b means that we define the probabilit y distribution ov er b via a conditional bar. In other words, the distribution ov over er b dep depends ends on the value of a. distribution, with a as one of the variables on the right side of the conditioning Contin tinuing uingwords, with the race example from supp suppose ose we Alice’s bar.Con In tin other therelay distribution over b depSec. ends16.1 on ,the value of name a. finishing time t0 , Bob’s finishing time t1 , and Carol’s finishing time t 2. As we saw Conour tinuing with the race example from Sec. 16.1 ose wedirectly name Alice’s earlier, estimate of t1relay dep depends ends on t0. Our estimate of ,tsupp depends ends on t 1 2 dep finishing time t , Bob’s finishing time t , and Carol’s finishing time t . As we saw but only indirectly on t0 . We can dra draw w this relationship in a directed graphical earlier, our estimate t dep on t . Our estimate of t depends directly on t mo model, del, illustrated in of Fig. 16.2ends . but only indirectly on t . We can draw this relationship in a directed graphical Formally ormally,, a directed graphical mo model del defined on variables x is defined by a model, illustrated in Fig. 16.2. directed acyclic graph G whose vertices are the random variables in the mo model, del, and x F ormally , a directed graphical mo del defined on v ariables is defined (x ia) a set of lo loccal conditional pr prob ob obability ability distributions p(xi | P aG (xi)) where P a Gby directed vertices are thedistribution random variables moby del, and giv gives es theacyclic paren parents tsgraph of xi inwhose G . The probability ov over er xin is the given a set of local conditional G probability distributions p(x P a (x )) where P a (x ) gives the parents of x in . pThe probability (x) = Πi p(x i | Pdistribution aG (xi)) )).. | over x is given by(16.1) G p(x) = Π p(x P a (x )). (16.1) In our rela relay y race example, this means that, using the graph drawn in Fig. 16.2, | In our relay race example, this means that, drawn in Fig.(16.2) 16.2, | t1graph p(t0, t 1 , t2 ) = p(t0 )p(t 1 | tusing ). 0 )p(t2the p(t , t , t ) = p(t )p(t t )p(t t ). (16.2) This is our first time seeing a structured probabilistic mo model del in action. We | | w structured mo can examine the cost of using it, in order to observ observe e ho how modeling deling has This is our first time seeing a structured probabilistic mo del in action. We man many y adv advan an antages tages relative to unstructured mo modeling. deling. can examine the cost of using it, in order to observe how structured modeling has Suppose we represented by discretizing time ranging from min minute ute 0 to manSupp y advose antages relative to time unstructured modeling. min minute ute 10 into 6 second ch chunks. unks. This would mak makee t0 , t 1 and t 2 eac each h be discrete Suppose represented time bIfy w discretizing rangingt pfrom 0 to (t0, t 1min variables withwe100 possible values. e attemptedtime to represen represent , t 2 ute ) with a min ute 10 into 6 second ch unks. This would mak e t , t and t eac h b e discrete table, it would need to store 999,999 values (100 values of t 0 × 100 values of t 1 × variables possible values. If we attempted t p(t , t , t )iswith 100 valueswith of t2100 , minus 1, since the probabilit probability y of onetoofrepresen the configurations madea table, it would need to store 999,999 values (100 values of t 100 values of t 567 100 values of t , minus 1, since the probability of one of the configurations is made × ×
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
redundan redundantt by the constraint that the sum of the probabilities be 1). If instead, we only make a table for each of the conditional probability distributions, then the redundant byovthe that the sum of the probabilities ben e 1). instead, we distribution er tconstraint given t0 If requires 9900 0 requires 99 values, the table defining t1 giv only make the conditional distributions, then the v alues, and asotable do does esfor theeach tableofdefining t 2 giv given en probability t 1. This comes to a total of 19,899 distribution o v er t requires 99 v alues, the table defining t giv en t requires 9900 values. This means that using the directed graphical mo model del reduced our num numb ber of values, and b soy do es the of table defining t given t . This comes to a total of 19,899 parameters a factor more than 50! values. This means that using the directed graphical model reduced our number of In general, to mo model del n discrete variables eac each h having k values, the cost of the parameters by a factor of more thann50! single table approac approach h scales like O (k ), as we hav havee observed before. No Now w supp suppose ose n k In general, to mo del discrete v ariables eac h having v alues, the cost of the we build a directed graphical mo model del ov over er these variables. If m is the maximum O (keither single table approachapp scales like(on ), as side we hav e observed before.bar) Now ose n um umb ber of variables appearing earing of the conditioning in supp a single we build a directed graphical model then over these variables. If m isfor thethe maximum conditional probability distribution, the cost of the tables directed n um b er of v ariables app earing (on either side of the conditioning bar) in an single m mo model del scales lik likee O( k ). As long as we can design a mo model del suc such h that m << , we conditional probability distribution, then the cost of the tables for the directed get very dramatic sa savings. vings. model scales like O( k ). As long as we can design a model such that m << n, we In other words, so long as eac each h variable has few paren parents ts in the graph, the get very dramatic savings. distribution can be represented with very few parameters. Some restrictions on otherstructure, words, sosuch longasasrequiring each variable in guaran the graph, the the In graph it to bhas e a few tree,paren can tsalso guarantee tee that distribution cancomputing be represented withor very few parameters. Some ov restrictions op operations erations like marginal conditional distributions over er subsets on of the graphare structure, v ariables efficien efficient. t. such as requiring it to be a tree, can also guarantee that operations like computing marginal or conditional distributions over subsets of It is important tot.realize what kinds of information can and cannot be enco encoded ded in variables are efficien the graph. The graph enco encodes des only simplifying assumptions ab about out which variables It is important to realize what kinds of information can and cannot encoe ded in are conditionally indep independent endent from each other. It is also possible tobemak make other the graph. The graphassumptions. encodes only F simplifying assumptions out which kinds of simplifying For or example, supp suppose ose weabassume Bobvariables alwa always ys are conditionally indep endent from each other. It is also p ossible to mak e other runs the same regardless of how Alice performed. (In reality reality,, Alice’s performance kinds of simplifying assumptions. F or example, supp we assume Bob ys probably influences Bob’s performance—dep erformance—depending ending onose Bob’s personalit ersonality y, ifalwa Alice runs the same regardless of how Alice p erformed. (In reality , Alice’s p erformance runs esp especially ecially fast in a given race, this might encourage Bob to push hard and probably Bob’s performance—dep ending Bob’s personalityand , if Alice matc match h herinfluences exceptional performance, or it migh might t makeon him ov overconfident erconfident lazy). runs esp ecially fast in a given race, this might encourage Bob to push hard and Then the only effect Alice has on Bob’s finishing time is that we must add Alice’s match hertime exceptional erformance, mighw t emake him overconfident and lazy). finishing to the p total amountorofittime think Bob needs to run. This Then the only effect Alice has on Bob’s finishing time is that we must add Alice’s observ observation ation allows us to define a mo model del with O(k ) parameters instead of O(k 2 ). finishing time to the total amount of time wedep think Bobwith needs run. This Ho How wev ever, er, note that t0 and t1 are still directly dependent endent thistoassumption, O(kBob k ). observ ation allows ts us the to define a mo delatwith ) parameters instead of O(time b ecause t1 represen represents absolute time which finishes, not the total How ever, note that t and This t are still our directly endent this he himself sp spends ends running. means graphdep must still with con contain tain anassumption, arrow from b ecause t represen ts the absolute time at which Bob finishes, not the total from time t0 to t1 . The assumption that Bob’s personal running time is indep independent endent he himself spendscannot running. means graph omust tain arrow from all other factors beThis enco encoded ded inour a graph ver t 0still , t 1 ,con and t2.anInstead, we tenco to t . The assumption that Bob’s p ersonal running time is indep endent from encode de this information in the definition of the conditional distribution itself. The all other factors cannotisbno e enco deda in over t table , t , and t . Instead, conditional distribution longer indexed by t0 andwe t1 k ×a kgraph − 1 element enco de this information in the definition of the conditional distribution itself. The but is now a slightly more complicated formula using only k − 1 parameters. The conditional distribution is syntax no longer a knotkplace 1 element table indexed by w t e and t directed graphical mo model del do does es an any y constraint on how define but is now a slightly more complicated formula × − using only k 1 parameters. The directed graphical model syntax does not place any constraint − on how we define 568
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
our conditional distributions. It only defines whic which h variables they are allow allowed ed to tak takee in as argumen arguments. ts. our conditional distributions. It only defines which variables they are allowed to take in as arguments.
16.2.2
Undirected Mo Models dels
Directed mo models dels giv give e us one language for describing structured proba16.2.2 graphical Undirected Mo dels bilistic mo models. dels. Another popular language is that of undir undireecte cted d mo models dels dels,, otherwise Directed graphical mo dels giv e us one language for describing structured kno known wn as Markov random fields (MRF (MRFs) s) or Markov networks (Kindermann,proba1980). bilistic dels.implies, Anotherundirected popular language is that of undir ecte d moare dels,undirected. otherwise As theirmo name mo models dels use graphs whose edges known as Markov random fields (MRFs) or Markov networks (Kindermann, 1980). Directed mo models dels are most naturally applicable to situations where there is As their name implies, undirected models use graphs whose edges are undirected. a clear reason to draw each arrow in one particular direction. Often these are Directed models are most naturally applicable tocausality situations where is situations where we understand the causality and the only flowsthere in one a clear reason to draw each arrow in one particular Often affect these the are direction. One such situation is the rela relay y race example.direction. Earlier runners situationstimes whereofwe understand the causality theaffect causality only flows in one finishing later runners; later runners and do not the finishing times of direction. One such situation is the rela y race example. Earlier runners affect the earlier runners. finishing times of later runners; later runners do not affect the finishing times of Not all situations we might wan wantt to mo model del hav havee such a clear direction to their earlier runners. in interactions. teractions. When the in interactions teractions seem to hav havee no in intrinsic trinsic direction, or to Not all situations we might wan t to mo del hav e such a clear direction tomo their op operate erate in both directions, it may be more appropriate to use an undirected model. del. interactions. When the interactions seem to have no intrinsic direction, or to As an example of suc such h a situation, supp suppose ose we wan antt to mo model del a distribution operate in both directions, it may be more appropriate to use an undirected model. over three binary variables: whether or not you are sic whether or not your sick, k, As an example of suc h a situation, supp ose we w an t to mo del a distribution co cow work orker er is sick, and whether or not your ro roommate ommate is sick. As in the relay race o v er three binary v ariables: whether or not youabout are sic k, whether or not your example, we can make simplifying assumptions the kinds of interactions co w ork er is sick, and whether or not your ro ommate is sick. As in the race that take place. Assuming that your cow coworker orker and your ro roommate ommate do relay not know example, make simplifying assumptions theother kindsa of interactions eac each h other,weit can is very unlik unlikely ely that one of them willabout give the disease such as place. Assuming that our cow orker our ommate not do not know athat coldtake directly directly. . This even eventt can beyseen as so rare and thatyit is ro acceptable to mo model del eac h other, it is very unlik ely that one of them will give the other a disease such as it. Ho Howev wev wever, er, it is reasonably likely that either of them could give you a cold, and athat coldyou directly . This even t can b e seen as so rare that it is acceptable not to mo del could pass it on to the other. We can mo model del the indirect transmission of it. Howev er,your it is cow reasonably likely that either of them could you a cold, a cold from to your ro b y mo thegive transmission of and the coworker orker roommate ommate modeling deling that y ou could pass it on to the other. W e can mo del the indirect transmission of cold from your cow coworker orker to you and the transmission of the cold from you to your a cold from your coworker to your roommate by modeling the transmission of the ro roommate. ommate. cold from your coworker to you and the transmission of the cold from you to your In this case, it is just as easy for you to cause your ro roommate ommate to get sic sick k as roommate. it is for your ro to make you sic so there is not a clean, uni-directional roommate ommate sick, k, In this it is easy for tomotiv causeates your roommate to get sic kdel. as narrativ narrative e oncase, which tojust baseasthe mo model. del.you This motivates using an undirected mo model. it iswith for your roommate you there is not a clean, uni-directional As directed mo models, dels,to if make tw twoo no nodes des sic in k, anso undirected mo model del are connected by an narrativ e on which to base the mo del. This motiv ates using an undirected mo del. edge, then the random variables corresp corresponding onding to those no nodes des in interact teract with each As with directed models, if two no des in the an undirected del are connected by no an other directly directly. . Unlik Unlike e directed mo models, dels, edge in anmo undirected mo model del has edge, random variables onding probabilit to those no interact with each arro arrow, w,then and the is not asso associated ciated withcorresp a conditional probability y des distribution. other directly. Unlike directed models, the edge in an undirected model has no denote theasso random ariable representing your health as h y, the random arroW w,e and is not ciatedvwith a conditional probabilit y distribution. 569 We denote the random variable representing your health as h , the random
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
hr
hy
hc
Figure 16.3: An undirected graph represen representing ting how your ro roommate’s ommate’s health hr , your health hy , and your work colleague’s health h c affect eac each h other. You and your ro roommate ommate Figure 16.3: An undirected graph represen ting how your ro ommate’s health h your migh mightt infect eac each h other with a cold, and you and your work colleague migh mightt do the, same, health h , andthat youryw orkro colleague’s health affect eacdo h other. Youeach and other, your rothey ommate but assuming our roommate ommate and yourh colleague not know can mightinfect infecteac eac otherindirectly with a cold, only each h hother via and you.you and your work colleague might do the same, but assuming that your ro ommate and your colleague do not know each other, they can only infect each other indirectly via you.
variable representing your ro roommate’s ommate’s health as hr , and the random variable represen representing ting your colleague’s health as h c. See Fig. 16.3 for a dra drawing wing of the graph v ariable ting representing your roommate’s health as h , and the random variable represen representing this scenario. representing your colleague’s health as h . See Fig. 16.3 for a drawing of the graph Formally ormally,, an undirected graphical mo model del is a structured probabilistic mo model del representing this scenario. 3 defined on an undirected graph G. For each clique C in the graph, a factor φ(C ) an undirected delaffinit is a ystructured probabilistic del (alsoFormally called a, clique potential otential))graphical measuresmo the affinity of the variables in that mo clique defined undirected For each clique in the are graph, a factor to φ( b)e for beingoninan each of their graph possible. joint states. The factors constrained (also called ae. clique potential measures the affinitydCofpr the variables in that clique G an unnormalize C non-negativ non-negative. Together they)define unnormalized prob ob obability ability distribution for being in each of their possible joint states. The factors are constrained to be non-negative. Together they define an unnormalized probability distribution (16.3) p˜(x) = ΠC∈G φ(C ). (16.3) p˜y(xdistribution ) = Π φ( is).efficien The unnormalized probabilit probability efficientt to work with so long as all the cliques are small. It enco encodes des the idea that affinity y are C states with higher affinit The unnormalized probabilit y distribution is efficien t to w ork with so long as more lik likely ely ely.. How Howev ev ever, er, unlike in a Bay Bayesian esian net netw work, there is little structure to the all the cliques arecliques, small. so It there encodes the ideatothat states with affinitthem y are definition of the is nothing guarantee that higher multiplying more likely . How unlike in a Bay network,See there littleforstructure to the together will yieldevaer,valid probabilit probability y esian distribution. Fig.is16.4 an example of definition of the cliques, so there is nothing to guarantee that multiplying them reading factorization information from an undirected graph. together will yield a valid probability distribution. See Fig. 16.4 for an example of Our factorization example of the cold spreading betw between een you,graph. your ro roommate, ommate, and your reading information from an undirected colleague con contains tains two cliques. One clique contains hy and hc. The factor for this Ourcan example of the spreading betw een robling ommate, and your clique be defined by cold a table, and migh might t ha hav ve you, valuesyour resem resembling these: colleague contains two cliques. One clique contains h and h . The factor for this clique can be defined by a table, and might have values resembling these: h y = 0 hy = 1 hc = 0 2 1 h = 0 h =1 hc = 1 1 10 h =0 2 1 hod=health, 1 1while a 10 A state of 1 indicates go goo state of 0 indicates poor health (ha (having ving been infected with a cold). Both of y you ou are usually healthy healthy,, so the A state of 1 indicates good health, while a state of 0 indicates poor health 3 A clique of the graph is with a subset of nodes Both that areofallyconnected to each other by an edge of (having been infected a cold). ou are usually healthy , so the the graph.
570
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
corresp corresponding onding state has the highest affinit affinity y. The state where only one of you is sic sick k has the low lowest est affinity affinity,, because this is a rare state. The state where both of state has highest y. the Theother) state where only affinity one of ystate, ou is ycorresp ou are onding sick (b (because ecause onethe of you has affinit infected is a higher sick hasstill the not lowest affinity, basecause this where is a rare state. The state though as common the state both are health healthy y. where both of you are sick (because one of you has infected the other) is a higher affinity state, To complete the mo model, del, we would need to also define a similar factor for the though still not as common as the state where both are healthy. clique con containing taining hy and h r. To complete the model, we would need to also define a similar factor for the clique containing h and h .
16.2.3
The Partition Function
While unnormalized probability distribution is guaranteed to be non-negativ non-negativee 16.2.3the The Partition Function ev everywhere, erywhere, it is not guaran guaranteed teed to sum or integrate to 1. To obtain a valid While the unnormalized probability distribution is onding guaranteed to be non-negativ e probabilit distribution, we must use the corresp normalized probability probability y corresponding everywhere, 4it is not guaranteed to sum or integrate to 1. To obtain a valid distribution: probability distribution, we must use the1 corresponding normalized probability (16.4) p(x) = p˜(x) distribution: Z 1 (x) (16.4) p(xin ) =the p˜probabilit where Z is the value that results probability y distribution summing or Z in integrating tegrating to 1: Z where Z is the value that results in the probability distribution summing or (16.5) Z = p˜(x)dx. integrating to 1:
(x)φ dxfunctions . = p˜the You can think of Z as a constan constanttZ when are held constant. (16.5) Note that if the φ functions ha hav ve parameters, then Z is a function of those parameters. Z Y ou can think of as a constan t when the φits functions areomitted held constant. Note Z with It is common in the literature to write arguments to sav savee space. that normalizing if the φ functions haveZ parameters, is a function of, those parameters. The constant is knownZas then the pZartition function function, a term borro orrow wed It is common in ph theysics. literature to write Z with its arguments omitted to save space. from statistical physics. The normalizing constant Z is known as the partition function, a term borrowed Since Z is an integral or sum ov over er all possible joint assignments of the state x from statistical physics. it is often in intractable tractable to compute. In order to be able to obtain the normalized Z x Since is an integral of or an sumundirected over all possible of theand statethe probabilit probability y distribution model,joint the assignments mo model del structure it is often in to compute. to be to able to obtainZthe normalized definitions oftractable the φ functions must In beorder conducive computing efficien efficiently tly tly.. In probabilit y distribution of an undirected model, the mo del structure and the the context of deep learning, Z is usually intractable. Due to the intractabilit intractability y φ Z definitions of the functions m ust b e conducive to computing efficien tly . In of computing Z exactly exactly,, we must resort to appro approximations. ximations. Such approximate Z the context of deep learning, is usually intractable. Due to the intractability algorithms are the topic of Chapter 18. of computing Z exactly, we must resort to approximations. Such approximate One imp importan ortan ortantt consideration to keep in mind when designing undirected mo models dels algorithms are the topic of Chapter 18. is that it is possible to sp specify ecify the factors in such a wa way y that Z do does es not exist. One imp ortan t consideration to k eep in mind when designing undirected models This happ happens ens if some of the variables in the mo model del are con contin tin tinuous uous and the integral is that is possible specify the in supp suchose a wa thatt Zto do esdel not exist. of p˜ ov over erit their domaintodiverges. For factors example, suppose wey wan want mo model a single This happens if some of the variables in the model are continuous and the integral 4 distribution defined diverges. by normalizing a product of clique Gibbs of p˜Aov er their domain For example, supp osepotentials we wantistoalso mocalled del aasingle distribution.
571
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
scalar variable x ∈ R with a single clique potential φ(x) = x2 . In this case, Z R scalar variable x with a singleZclique p2otential φ(x) = x . In this case, (16.6) = x dx. ∈ x dx. y distribution corresp (16.6) Since this integral diverges, there Z is = no probabilit probability corresponding onding to this choice of φ(x). Sometimes the choice of some parameter of the φ functions Since this integral diverges, there isynodistribution probability isdistribution ondingfor to determines whether the probabilit probability defined. Fcorresp or example, this of φ−(x the choice of some parameter of the φPositive functions , the β parameter eβ φ (x ; choice β) = exp β)x. 2 Sometimes Z determines whether Z exists. Positiv determines whether the probabilit y distribution is defined. F or example, for results in a Gaussian distribution over x but all other values of β mak makee φ imp impossible ossible φ ; β) = exp β x , the β parameter determines whether Z exists. Positive β to(xnormalize. results in a Gaussian distribution over x but all other values of β make φ impossible − One key difference betw etween een directed mo modeling deling and undirected mo modeling deling is that to normalize. directed mo models dels are defined directly in terms of probabilit probability y distributions from One key difference b etw een directed mo deling and undirected deling is that that the start, while mo models dels are defined more lo loosely osely by φmo functions undirected directed models arein defined directly in terms of This probabilit y distributions are then con conv verted into to probability distributions. changes the intuitionsfrom one φ the start, while undirected mo dels are defined more lo osely by functions that must develop in order to work with these mo models. dels. One key idea to keep in mind are then con v erted in to probability distributions. This changes theofintuitions one while working with undirected mo models dels is that the domain of each the variables m ust develop in order to w ork with these mo dels. One key idea to keep in mind has dramatic effect on the kind of probability distribution that a given set of φ while working with undirected models is consider that the an domain of each ofvector-v the variables n -dimensional functions corresp corresponds onds to. For example, vector-valued alued has dramatic effect on the kind of probability distribution that a given set of φ random variable x and an undirected mo model del parametrized by a vector of biases functions corresp onds to.clique For example, anxn, -dimensional vector-v alued . Supp Suppose ose we ha have ve one for each consider element of b φ (i)(x i ) = exp exp( (bix i). What x random v ariable and an undirected mo del parametrized by a vector of biases kind of probability distribution do does es this result in? The answ answer er is that we do . Supp ose we ha ve one clique for each element of , x b x φ ( ) = exp b x ). What not hav havee enough information, because we hav havee not yet sp specified ecified the (domain of x. kind of probability distribution do es this result in? The answ er is that w e do n If x ∈ R , then the integral defining Z div diverges erges and no probability distribution not havIf e enough information, because we have not yetenden specified the domain with of x. n exists. independen endent t distributions, R x ∈ { 0, 1} , then p (x ) factorizes into n indep Z x If , then the integral defining div erges and no probability distribution elementary tary basis vectors p(x i = 1) = sigmoid (bi ). If the domain of x is the set of elemen x 0 , 1 p ( x n exists. If , then ) factorizes into indep enden t distributions, with ∈ [1,, 0, . . . , 0] 0],, [0 [0,, 1, . . . , 0], . . . , [0 [0,, 0, . . . , 1] 1]} } ) then p( x ) = softmax ({ [1 softmax((b), so a large (b ). If the of x is the set of elementary basisleverage vectors (x =of1)b =actually ∈sigmoid { } reduces p(xdomain vpalue erage i j = 1) for j 6 = i . Often, it is p ossible to lev [1,effect 0, . . . of , 0]a, [0 , 1 , . . . , 0] , . . . , [0 , 0 , . . . , 1] p ( x ) = ( b (the ) then softmax ) , so a large carefully chosen domain of a variable in order to obtain complicated b p ( = i j vb{alue of from actually reducessimple x =set 1)offorφ . Often, is pexplore ossible to leverage } functions. eha ehavior vior a relatively Weitwill a practical the effect of a carefully chosen domain of a v ariable in order to obtain complicated application of this idea later, in Sec. 20.6. 6 behavior from a relatively simple set of φ functions. We will explore a practical application of this idea later, in Sec. 20.6.
16.2.4
Energy-Based Mo Models dels
Man Many y interesting theoreticalMo results about out undirected mo models dels dep depend end on the as16.2.4 Energy-Based dels ab sumption that ∀x, p˜(x ) > 0. A conv convenient enient wa way y to enforce this condition is to use Man y interesting theoretical results ab out undirected models depend on the asan ener energy-b gy-b gy-base ase ased d mo model del (EBM) where sumption that x, p˜(x ) > 0. A convenient way to enforce this condition is to use ˜(x) = exp(−E (x)) (16.7) an energy-based∀ model (EBM) pwhere exp((z ) is positiv and E ( x) is kno known wn as the ener energy gy function function. . Because ositivee for all z(16.7) , this p˜(x ) = exp( E (x)) exp x. guaran guarantees tees that no energy function will result in a probabilit probability y of zero for an any y state statex and E ( x) is known as the energy function. − Because exp (z ) is positive for all z , this 572 in a probability of zero for any state x. guarantees that no energy function will result
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
b
c
d
e
f
Figure 16.4: This graph implies that p(a, b, c, d, e, f) can b e written as 1 φ ( ) φ ( b,c b, c )φ a,d (a, d ) φb,e (b, e) φ e,f(e, f ) for an appropriate choice of the φ funcZ a,b a, b Figure 16.4: This graph implies that p(a, b, c, d, e, f) can b e written as tions. φ (a, b )φ (b, c )φ (a, d ) φ (b, e)φ (e, f ) for an appropriate choice of the φ functions.
Being completely free to choose the energy function mak makes es learning simpler. If we learned the clique potentials directly directly,, we would need to use constrained optimization Being completely free to choose the energy function makyesvalue. learning we to arbitrarily imp impose ose some sp specific ecific minimal probabilit probability Bysimpler. learningIfthe learned the clique p otentials directly , we would need to use constrained optimization energy function, we can use unconstrained optimization.5 The probabilities in an to arbitrarily mo imp osecan some specific minimal close probabilit y vbut alue.nev By energy-based model del approac approach h arbitrarily to zero never er learning reac reach h it. the energy function, we can use unconstrained optimization. The probabilities in an An Any y distribution of the form given by Eq. 16.7 is an example of a Boltzmann energy-based model can approach arbitrarily close to zero but never reach it. distribution distribution.. For this reason, many energy-based mo models dels are called Boltzmann An y distribution of the form given by Eq. 16.7 is an example of, 1984 a Boltzmann machines (Fahlman et al. al.,, 1983; Ac Ackley kley et al. al.,, 1985; Hinton et al. al., ; Hin Hinton ton distribution . F this reason,is many energy-based mofor dels are to called and Sejno Sejnowski wski , or 1986 ). There no accepted guideline when call aBoltzmann mo model del an machines (Fahlman et al. , 1983 Ackley et al., 1985mac ; Hinton et al. , 1984 ; Hinton energy-based mo model del and when to ;call it a Boltzmann machine. hine. The term Boltzmann and Sejnowas wskifirst , 1986 ). There to is no accepted guideline when tobinary call a mo del an mac machine hine introduced describ describe e a mo model del with for exclusively variables, energy-based mo del and when to call it a Boltzmann mac hine. The term Boltzmann but to today day many mo models dels such as the mean-cov mean-covariance ariance restricted Boltzmann mac machine hine mac hine wasreal-v first alued introduced to describ a model with exclusively binary variables, incorp incorporate orate real-valued variables as well.e While Boltzmann mac machines hines were originally but today modelsbsuch as the ariance restricted Boltzmann macterm hine defined tomany encompass oth mo models delsmean-cov with and without latent v ariables, the incorporate real-v aluedisvto ariables as well. While machines were originally Boltzmann mac machine hine toda da day y most often usedBoltzmann to designate mo models dels with laten latentt defined to encompass b oth mo dels with and without latent v ariables, the term variables, while Boltzmann machines without laten latentt variables are more often called Boltzmann machine day mostmo often Mark fieldsisortolog-linear Marko ov random models. dels.used to designate models with latent variables, while Boltzmann machines without latent variables are more often called Cliques in an undirected graph corresp correspond ond to factors of the unnormalized Markov random fields or log-linear models. exp((a ) exp exp(( b) = exp exp((a + b ), this means that different probabilit probability y function. Because exp Cliques in an undirected graph corresp ond to different factors ofterms the unnormalized cliques in the undirected graph corresp correspond ond to the of the energy exp(a ) exp( b) mo = exp a+ b ), athis probabilityInfunction. Because means that function. other words, an energy-based model del(is just sp special ecial kind of different Marko Markov v cliques in the undirected graph corresp ond to the different terms of the energy net netw work: the exp exponentiation onentiation mak makes es eac each h term in the energy function corresp correspond ond function. In other words, an energy-based mo del is just a sp ecial kind of Marko v to a factor for a different clique. See Fig. 16.5 for an example of ho how w to read the net w ork: the exp onentiation mak es eac h term in the energy function corresp ond form of the energy function from an undirected graph structure. One can view an to a factor formo a different clique. Fig.in16.5 for an example w toaread the energy-based model del with m ultipleSee terms its energy function of as ho being pr pro oduct form of the energy, 1999 function from an undirected graph structure. One an of exp experts erts (Hinton ). Eac Each h term in the energy function corresp corresponds ondscan to view another energy-based model with multiple terms in its energy function as being a product 5 Forerts some models,, 1999 we may still hneed to in usethe constrained optimization to make exists. of exp (Hinton ). Eac term energy function corresp ondssure to Zanother 573
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
b
c
d
e
f
Figure 16.5: This graph implies that E(a, b, c, d, e, f ) can b bee written as Ea,b (a, b ) + Eb,c (b, c) + E a,d (a, d ) + Eb,e (b, e ) + Ee,f (e, f) for an appropriate choice of the p er-clique , b,φc,functions , bh) φ+ E(athe E (aeac Figure functions. 16.5: ThisNote graph d, e, f ) can e written energy thatimplies we canthat obtain in bFig. 16.4 byas setting each E ( ) + E ( ) + E ( ) + E ( , , , , ) c onen d the corresp b eonding negativ e f for an appropriate the(− p er-clique to theb exp exponen onential tiala of corresponding negative e energy energy, , e.g., φ a,b(choice a, b) =ofexp E (a, b)). energy functions. Note that we can obtain the φ functions in Fig. 16.4 by setting each φ to the exp onential of the corresp onding negative energy, e.g., φ ( a, b) = exp ( E (a, b)). factor in the probability distribution. Eac Each h term of the energy function − can b e
though thoughtt of as an “exp “expert” ert” that determines whether a particular soft constraint factor in the probability h term of the energy canonly be is satisfied. Eac Each h expertdistribution. ma may y enforceEac only one constrain constraint t thatfunction concerns though t of as an “exp ert” that determines whether a particular soft constraint a lo low-dimensional w-dimensional pro projection jection of the random variables, but when com combined bined by is satisfied. Eac expert may the enforce only one constrain t that concerns only multiplication of hprobabilities, exp together enforce a complicated highexperts erts a low-dimensional prot.jection of the random variables, but when combined by dimensional constrain constraint. multiplication of probabilities, the experts together enforce a complicated highOne part ooff the definition of an energy-based mo model del serv serves es no functional dimensional constraint. purp purpose ose from a machine learning poin ointt of view: the − sign in Eq. 16.7. This Onecould part boef incorp the definition of an mo delforserv es no functional E incorporated orated into theenergy-based definition of E , or many functions − sign purp ose from a machine learning p oin t of view: the sign in Eq. 16.7 . This the learning algorithm could simply learn parameters with opp opposite osite sign. The − E E sign could b e incorp orated into the definition of , or for many functions − sign is present primarily to preserve compatibility bet etw ween the machine learning the learning algorithm could simply learn parameters with opp osite sign. The − literature and the physics literature. Man Many y adv advances ances in probabilistic mo modeling deling sign present primarily compatibility etwwhom een theEmachine − were isoriginally developedtobypreserve statistical ph physicists, ysicists, bfor refers tolearning actual, literature and the physics literature. Man y adv ances in probabilistic mo deling ph physical ysical energy and do does es not hav havee arbitrary sign. Terminology such as “energy” E refers w ere originally developed by statistical physicists, for whom to though actual, and “partition function” remains asso associated ciated with these techniques, even ph ysical energy and do es not hav e arbitrary sign. T erminology such as “energy” their mathematical applicability is broader than the physics context in which they and remains associated with these w ere “partition dev develop elop eloped. ed.function” Some machine learning researchers (e.g.,techniques, Smolensky even (1986though ), who their mathematical is broader than the physics in which referred to negative applicability energy as harmony ) hav have e chosen to emit context the negation, but they this w ere dev elop ed. Some machine learning researchers (e.g., Smolensky ( 1986 ), who is not the standard con conv ven ention. tion. referred to negative energy as harmony) have chosen to emit the negation, but this Man Many y algorithms that op operate erate on probabilistic mo models dels do not need to compute is not the standard convention. models dels with latent variables h, pmodel (x ) but only log p˜model (x ). For energy-based mo Man y algorithms that op erate on probabilistic mo dels do not need to quan compute these algorithms are sometimes phrased in terms of the negative of this quantit tit tity y, ) but only ) . F or energy-based mo dels with latent v ariables , p ( x log p ˜ ( x h called the fr freee ener energy gy gy:: these algorithms are sometimes phrased in terms of the negative of this quan tit y , X called the free energy: F (x) = − log exp (−E (x, h)) . (16.8) h
exp ( E (x, h)) . (x) = log (16.8) In this book, we usuallyFprefer the more general log p ˜ ( x ) form formulation. ulation. model − − 574general log p In this book, we usually prefer the more ˜ X
(x) formulation.
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
s
b
a
(a)
s
b
(b)
Figure 16.6: (a) The path b(a) etw etween een random variable a and random variable b through s is (b) activ active, e, b ecause s is not observ observed. ed. This means that a and b are not separated. (b) Here s Figure 16.6: The paththat b etwiteen variable a and variable b through is shaded in,(a) to indicate is random observed. Because therandom only path b et etw ween a and bs is activ e, b ecause s is not observ ed. This means that a and b are not separated. (b)giv Here through s, and that path is inactiv inactive, e, we can conclude that a and b are separated given en ss. is shaded in, to indicate that it is observed. Because the only path b etween a and b is through s, and that path is inactive, we can conclude that a and b are separated given s .
16.2.5
Separation and D-Separation
16.2.5 D-Separation The edges Separation in a graphical and mo model del tell us whic which h variables directly interact. We often need to know which variables indir indireectly interact. Some of these indirect interactions The in a graphical moby delobserving tell us whic h variables directly interact., we We would often can bedges e enabled or disabled other variables. More formally formally, need toknow know whic which variablesofindir ectly interact. Some of these indirect lik like e to which h subsets variables are conditionally indep independen enden endenttinteractions from each can b e enabled or disabled by observing other v ariables. More formally , we would other, giv given en the values of other subsets of variables. like to know which subsets of variables are conditionally independent from each Iden Identifying tifying the conditional indep independences endences in a graph is very simple in the case other, given the values of other subsets of variables. of undirected mo models. dels. In this case, conditional indep independence endence implied by the graph Iden tifying the conditional indep endences in a graph isar vate erydsimple in the case is called sep separ ar aration ation ation.. We say that a set of variables A is sep separ arate ated from another set of undirected mo dels. In this case, conditional indep endence implied b y the graph of variables B giv given en a third set of variables S if theAgraph structure implies that A is called sep ar ation . Wegiven say seto of variables arate d from another set S. Ifatw is indep independent endent enthat two variables a andis bsepare connected by a path B from B giv S A ofvvolving ariables en a third set of variables if the graph structure implies that in inv only giv unobserv unobserved ed v ariables, then those v ariables are not separated. If no B S is indep endent from given . Ifpaths two v ariablesana observ and b ed arevariable, connected a path path exists betw etween een them, or all contain observed thenbythey are in v olving only unobserv ed v ariables, then those v ariables are not separated. If no separated. We refer to paths in inv volving only unobserved variables as “activ “active” e” and path exists b etw een them, or all paths contain an observ ed v ariable, then they are paths including an observ observed ed variable as “inactiv “inactive.” e.” separated. We refer to paths involving only unobserved variables as “active” and When we draw a graph, we can indicate observed variables by shading them paths including an observed variable as “inactive.” in. See Fig. 16.6 for a depiction of how active and inactive paths in an undirected When drawdrawn a graph, indicate observed variables by shading them mo model del lookwewhen in we thiscan wa way y. See Fig. 16.7 for an example of reading in. See Fig. 16.6 for a depiction of how active and inactive paths in an undirected separation from an undirected graph. model look when drawn in this way. See Fig. 16.7 for an example of reading Similar concepts apply to directed mo models, dels, except that in the con context text of separation from an undirected graph. directed mo models, dels, these concepts are referred to as d-sep d-separ ar aration ation ation.. The “d” stands for Similar concepts apply to directed mo dels, except in the text of “dep “dependence.” endence.” D-separation for directed graphs is defined that the same as con separation directed models, these W concepts areareferred to as d-sep ation. The “d” stands for for undirected graphs: e say that set of variables from another A isard-separated “dep endence.” for directed graphs isS defined the same as separation B giv set of variablesD-separation given en a third set of variables ifAthe graph structure implies for undirected graphs: W e say that a set of v ariables is d-separated from another that A is indep independen enden endent t from B giv given en S. B S set of variables given a third set of variables if the graph structure implies A B S As with undirected mo models, dels, we can examine the indep independences endences implied by the that is independent from given . graph by lo looking oking at what activ activee paths exist in the graph. As before, tw two o variables with undirected we can examine indep endences implied bysuch the are As dep dependen enden endent t if there ismo andels, activ active e path betw etween een the them, and d-separated if no graph by looking at what active paths exist in the graph. As before, two variables are dependent if there is an active path575 between them, and d-separated if no such
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a b
c
d
Figure 16.7: An example of reading separation prop properties erties from an undirected graph. Here b is shaded to indicate that it is observed. Because observing b blo bloccks the only path from 16.7: example reading separation anb.undirected graph. aFigure to c , w e sa say yAn that a and of c are separated from prop eacherties otherfrom given The observ observation ation Here of b b is shaded to indicate thateen it is observed. blocekspath the only path from also blocks one path b etw etween a and d , butBecause there isobserving a second,bactiv active b etw etween een them. a to c , we sa that a and are separated Therefore, a yand d are notc separated givenfrom b. each other given b. The observation of b also blocks one path b etween a and d , but there is a second, active path b etween them. Therefore, a and d are not separated given b.
path exists. In directed nets, determining whether a path is active is somewhat more complicated. See Fig. 16.8 for a guide to iden identifying tifying active paths in a directed path exists. In directed nets, determining whether patherties is active mo model. del. See Fig. 16.9 for an example of reading someaprop properties fromisa somewhat graph. more complicated. See Fig. 16.8 for a guide to identifying active paths in a directed It is imp importan ortan ortantt to remem rememb ber that separation and d-separation tell us only ab about out model. See Fig. 16.9 for an example of reading some properties from a graph. those conditional indep independences endences that are implied by the graph. There is no It is impt ortan to graph rememimply ber that and that d-separation tell us about requiremen requirement that tthe all separation indep independences endences are present. In only particular, that are implied by the graph. those conditional indep endences There is no it is alwa always ys legitimate to use the complete graph (the graph with all possible edges) requiremen t that the graph imply indep endences that arecontain present.indep In particular, to represent any distribution. In all fact, some distributions independences endences it is alwa ys legitimate use thetcomplete graphgraphical (the graph with allContext-sp possible edges) that are not possible totorepresen represent with existing notation. Context-spe ecific to represent any distribution. In fact, some distributions contain indep endences indep independenc endenc endences es are indep independences endences that are present dep dependent endent on the value of some that are not p ossible to represen t with existing graphical Context-sp ecific variables in the netw network. ork. For example, consider a mo model del notation. of three binary variables: indep endenc es are indep endences that are present dep endent on the v alue of a, b and c. Suppose that when a is 0, b and c are indep independent, endent, but when a is some 1, b visariables in the netw ork. to Forc .example, a model of athree deterministically equal Encodingconsider the behavior when = 1 binary requiresvariables: an edge a, b and c. bSuppose when athen is 0,fails b and c are indep endent, aendent is 1, b connecting and c . that The graph to indicate that b and but c arewhen indep independent is deterministically equal to c . Encoding the behavior when a = 1 requires an edge when a = 0. connecting b and c . The graph then fails to indicate that b and c are independent In general, a graph will never imply that an indep independence endence exists when it do does es when a = 0. not. How However, ever, a graph may fail to enco encode de an indep independence. endence. In general, a graph will never imply that an independence exists when it does not. However, a graph may fail to encode an independence.
16.2.6
Con Converting verting betw etween een Undirected and Directed Graphs
W e often refer a sp specific ecific machine mo model del and as being undirected or directed. 16.2.6 Conto verting betw een learning Undirected Directed Graphs For example, we typically refer to RBMs as undirected and sparse co coding ding as directed. W e often refer to a sp ecific machine learning mo del as b eing undirected or directed. This choice of wording can be somewhat misleading, because no probabilistic Fordel example, we typically referortoundirected. RBMs as undirected and sparse coding as directed. mo model is inherently directed Instead, some mo models dels are most easily This choice of w ording can b e somewhat misleading, b ecause no probabilistic described ed using an undirected describ described ed using a directed graph, or most easily describ mo del is inherently directed or undirected. Instead, some models are most easily graph. described using a directed graph, or most easily described using an undirected 576 graph.
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
s
b a
a
s
(a)
b
b
(b)
(a) a
s
b
a
(b) s
s
c
(c)
(d)
b
(c) (d) Figure 16.8: All of the kinds of active paths of length tw two o that can exist b etw etween een random variables a and b. (a) Any path with arrows pro proceeding ceeding directly from a to b or vice versa. Figure 16.8: of bthe kindsblo of ck active of length tw thate can existseen b etwthis een random This kind of All path ecomes s is observed. W already kind of block cked ed if paths Weeo hav have (a) v ariables a and b . Any path with arrows pro ceeding directly from a to b or vice path in the relay race example. (b) a and b are connected by a common cause s.versa. For This kindsupp of path ecomes blo ckindicating ed if s is whether observed.or W e hav e already seen this of example, suppose ose sbis a variable not there is a hurricane andkind a and cause path in thethe relay race example. a and bnearby are connected by a common or b measure wind sp speed eed at tw two o(b) different weather monitoring outp outposts. osts. s.IfFwe example, supphigh ose swinds is a variable indicating whether or not theresee is high a hurricane and and observ observe e very at station a , we migh might t exp expect ect to also winds at b. aThis b measure the wind sp eed o different weather monitoring osts. If w we kind of path can b e blo bloc ckedatbytwobserving s. nearby If we already know there is outp a hurricane, e observ a , we of migh t exp ect to also high winds b. ected This exp expect ect etovery see high high winds at bstation , regardless what is observed at see a. A lo lower wer thanatexp expected kind of b e blo cked would by observing s. If we know of there is aat hurricane, we wind at path a (forcan a hurricane) not change ouralready exp expectation ectation winds b (kno (knowing wing exp ectistoa see high windsHow at bever, , regardless of what is observed a. Ab lo werdep than exp ected there hurricane). However, if s is not observed, then aatand are dependent, endent, i.e., wind at ais(for a hurricane) not change ectation of winds at b (kno wing the path active. (c) a and would b are both parentsour of sexp . This is called a V-structur V-structure e or the there is a hurricane). How ever, if s is not observed, then a and b are dep endent, i.e., col ollider lider case ase.. The V-structure causes a and b to b e related by the explaining away effe effect ct ct.. (c)isa actually e or thea thethis path is active. and b areactive both when parents s . This a V-structur In case, the path s isofobserv observed. ed.isFcalled or example, supp suppose ose s is col lider cindicating ase. The V-structure away ct. causes aisand e related y the explaining v ariable that your colleague notbattowbork. The vbariable a represen represents ts her effe being In k, this case,b the path ts is her actually when s isIfobserv ed. For that example, s is a sic sick, while represen represents b eingactive on vacation. you observe she issupp notose at work, variable that colleague work. The represen her y ou can indicating presume she is your probably sick is or not on vatacation, but vitariable is notaesp especially eciallytslik likely elybeing that sic k, while b represenat ts the her same b eingtime. on vacation. If you observe is not at b oth hav havee happened If you find out that she that is onshe vacation, thiswork, fact y ou can presume she is probably sick or on v acation, but it is not esp ecially lik ely that is sufficien sufficientt to explain her absence. You can infer that she is probably not also sic sick. k. (d) b othexplaining have happened at thehapp same out thatofshe on vacation, this fact The away effect happens enstime. ev even en Ifif you an any y find descendant s isisobserv observed! ed! For example, (d) is sufficien t toc explain her absence. You can infer that she probably notort also sick.your supp suppose ose that is a variable representing whether you ha hav veisreceiv received ed a rep report from The explaining ay effect en ifreceiv any ed descendant of this s is increases observed! your For example, colleague. If youawnotice thathapp you ens ha e not the rep estimate hav vev received report, ort, supp ose that c is a v ariable representing whether y ou ha v e receiv ed a rep ort from of the probabilit probability y that she is not at work today today,, which in turn makes it more likely your that colleague. If you you ha ve not repaort, thisthrough increases your estimate she is either sick notice or on vthat acation. The onlyreceiv way ed to the block path a V-structure is of the probabilit she is not atofwork , which to observ observe e none yofthat the descendants the today shared child. in turn makes it more likely that she is either sick or on vacation. The only way to block a path through a V-structure is to observe none of the descendants of the shared child. 577
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
b
c
d
e
Figure 16.9: From this graph, we can read out several d-separation properties. Examples include: Figure 16.9: From this graph, we can read out several d-separation properties. Examples • a and b are d-separated given the empty set. include: • aa and and eb are are d-separated d-separated given given c. the empty set.
• d and ee are are d-separated d-separated given given c. c. a and • also see that some variables are no longer d-separated when we observe some We can d and e are d-separated given c. variables: • We can also see that some variables are no longer d-separated when we observe some • a and b are not d-separated given c. variables: • aa and and b b are are not not d-separated d-separated given given d. c. •
a and b are not d-separated given d.
•
578
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Figure 16.10: Examples of complete graphs, whic which h can describ describee any probabilit probability y distribution. Here we show examples with four random variables. (L (Left) eft) The complete undirected graph. Figure 16.10: Examples of complete h can describ probabilit y distribution. In the undirected case, the completegraphs, graph whic is unique. (R (Right) ight)e any A complete directed graph. eft) The Here e show examples withis four undirected graph. In thewdirected case, there not arandom uniquevariables. complete (L graph. Wecomplete choose an ordering of the (R ight) In the undirected case, the complete graph is unique. A complete directed graph. variables and dra draw w an arc from eac each h variable to every variable that comes after it in the In the directed thereaisfactorial not a unique complete graph.graphs We choose an of the ordering. Therecase, are thus num umb b er of complete for ev every eryordering set of random ariables and draw an arc we from eachthe variable to every ariable that comes after it in the variables. In this example order variables from vleft to righ right, t, top to bottom. ordering. There are thus a factorial numb er of complete graphs for every set of random variables. In this example we order the variables from left to right, top to bottom.
Directed mo models dels and undirected mo models dels both hav havee their adv advan an antages tages and disadvan antages. tages. Neither approach is clearly sup superior erior and universally preferred. Instead, Directed models and undirectedtomo dels oth hav e their antages disadwe should choose whic which h language use forbeach task. Thisadv choice willand partially vdep antages. approachyisdistribution clearly superior and to universally Instead, depend end onNeither whic which h probabilit probability we wish describ describe. e. preferred. We ma may y cho hoose ose to w e should choose whic h language to use for each task. This choice will partially use either directed mo modeling deling or undirected mo modeling deling based on which approach can dep end on whic h probabilit y distribution w e wish describe. or Wewhic mayh capproac hoose to capture the most indep independences endences in the probability to distribution which approach h use either directed modeling or undirected modeling based can uses the fewest edges to describ describe e the distribution. There on arewhich otherapproach factors that capture the most indepof endences in the probability distribution or whic h approac h can affect the decision which language to use. Even while working with a single uses the fewest edges to describ distribution. are different other factors that probabilit probability y distribution, we maye the sometimes switchThere bet etw ween mo modeling deling can affect the decision of which language to use. Even while working with a single languages. Sometimes a different language becomes more appropriate if we observ observee y distribution, we may bet een different modeling aprobabilit certain subset of variables, or if sometimes we wish to switch perform a wdifferen different t computational languages. Sometimes differentmo language becomesoften moreprovides appropriate if wetforw observ task. For example, the adirected model del description a straigh straightforw tforward arde a certain subset of variables, or if wfrom e wish perform a differen computational approac approach h to efficiently dra draw w samples thetomo model del (describ (described ed intSec. 16.3) while task. F or example, the directed mo del description often provides a straigh tforward the undirected mo model del formulation is often useful for deriving approximate inference approac h to(as efficiently from model in Sec. 16.3 )dels while pro procedures cedures we willdra seewinsamples Chapter 19,the where the(describ role of ed undirected mo models is the undirected mo del formulation is often useful for deriving approximate inference highligh highlighted ted in Eq. 19.56). procedures (as we will see in Chapter 19, where the role of undirected models is Ev Every ery probability distribution can be represented by either a directed mo model del highlighted in Eq. 19.56). or by an undirected mo model. del. In the worst case, one can alwa always ys represen representt any Every probability be represented a directed mothe del distribution by using distribution a “complete can graph.” In the casebyofeither a directed mo model, del, or by an graph undirected del. Inacyclic the worst one alwa ys represen t any complete is anymo directed graphcase, where wecan imp impose ose some ordering on distribution by using a “complete graph.” In the case of a directed mo del, the the random variables, and each variable has all other variables that precede it in complete graph is any directed acyclic graph we impose some ordering on the ordering as its ancestors in the graph. For where an undirected mo model, del, the complete the random variables, each variable all other variables all that precede it in graph is simply a graphand con containing taining a singlehas clique encompassing of the variables. the ordering its an ancestors in the graph. For an undirected model, the complete See Fig. 16.10asfor example. graph is simply a graph containing a single clique encompassing all of the variables. See Fig. 16.10 for an example. 579
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Of course, the utility of a graphical mo model del is that the graph implies that some variables do not interact directly directly.. The complete graph is not very useful because it course, of a graphical model is that the graph implies that some do does esOfnot implythe an any yutility indep independences. endences. variables do not interact directly. The complete graph is not very useful because it we represent a probability wantt to cho hoose ose doesWhen not imply any indep endences. distribution with a graph, we wan a graph that implies as many indep independences endences as possible, without implying an any y When we represent a probability distribution with a graph, we want to choose indep independences endences that do not actually exist. a graph that implies as many independences as possible, without implying any From this poin ointt of view, some distributions can be represented more efficiently independences that do not actually exist. using directed mo models, dels, while other distributions can be represented more efficien efficiently tly F rom this p oin t of view, some distributions can b e represented more efficiently using undirected mo models. dels. In other w words, ords, directed models can encode some using endences directed mo dels, while other distributions can de, be represented more efficiently indep independences that undirected mo models dels cannot enco encode, and vice versa. using undirected models. In other words, directed models can encode some Directed mo models dels are able to use one sp specific ecific kind of substructure that undirected independences that undirected models cannot encode, and vice versa. mo models dels cannot represent perfectly erfectly.. This substructure is called an immor immorality ality ality.. The Directed models are tw able to use one sp ecific kind of substructure that undirected structure occurs when random v ariables a and b are b oth parents of a third two o mo dels cannot represent p erfectly . This substructure is called an immor ality . The random variable c , and there is no edge directly connecting a and b in either structure o(The ccursname when“immoralit two random variables a and bitare parents a third direction. “immorality” y” may seem strange; wasboth coined in theofgraphical random v ariable c , and there is no edge directly connecting a and b in mo models dels literature as a joke ab about out unmarried parents.) To conv convert ert a directed either mo model del direction. (The name “immoralit y” may seem strange; it was coined in the graphical D U with graph in into to an undirected mo model, del, we need to create a new graph . F For or moery delspair literature as a joke abyout parents.) T o conv ert a directed ev every of variables x and , wunmarried e add an undirected edge connecting x andmo y del to with graphis a in to an undirected model, we needconnecting to create ax new directed edge (in either direction) and ygraph in D or. F ifor x U if there ev ery pair of v ariables x and y , w e add an undirected edge connecting x and y to D U and y are both parents in D of a third variable z . The resulting U is known as if there a aph directed (in either direction)ofconnecting and y inmo or if to x a mor moralize alize alized dis gr graph aph. . See edge Fig. 16.11 for examples conv converting ertingxdirected models dels and othdels parents in of a third variable z . The resulting is known as U y are bmo D undirected models via moralization. a moralized graph. See Fig.D16.11 for examples of converting directed U models to Lik Likewise, ewise, undirected mo models dels can include substructures that no directed mo model del undirected models via moralization. can represen representt perfectly erfectly.. Sp Specifically ecifically ecifically,, a directed graph D cannot capture all of the Likewise,indep undirected mo dels canbyinclude substructures directed del conditional an undirected graphthat independences endences implied contains tains amolo loop op U if no U con canlength represen t perfectly Specifically , athat directed cannot alllo ofopthe of greater than. three, unless lo loop op graph also contains a capture chor chord d. A loop is conditional indep endences implied b y an undirected graph if con tains a lo op D a sequence of variables connected by undirected edges, with the last variable in of length greater than three, unless that vlo op alsoincontains A loopis is U a UchordA. chord the sequence connected back to the first ariable the sequence. a a sequence of v ariables connected b y undirected edges, with the last v ariable in connection betw etween een any tw two o non-consecutive variables in the sequence defining a the back four to the variable inesthe A chord is a lo loop. op.sequence If U hasconnected lo loops ops of length or first greater and do does notsequence. hav havee chords for these connection betw eenthe anychords two non-consecutive variables sequence defining a lo loops, ops, we must add before we can conv convert ert it toina the directed mo model. del. Adding lo op. If has lo ops of length four or greater and do es not hav e chords for these these chords discards some of the indep independence endence information that was enco encoded ded in lo ops, we m ust add the chords b efore w e can conv ert it to a directed mo del. Adding U U . The graph formed by adding chords to U is kno known wn as a chor chordal dal or triangulate triangulated d these chords discards thenow indep information was enco ded in graph, because all thesome lo loops opsofcan beendence describ described ed in terms ofthat smaller, triangular .ops. TheTgraph by adding is knowngraph, as a chor dal ortotriangulate d D fromto lo loops. o buildformed a directed graph chords the chordal we need also assign graph, because alledges. the loops candoing now bso, e describ ed in of asmaller, triangular U Ue must directions to the When w notterms create directed cycle in lo ops. T o build a directed graph from the chordal graph, we need to also D, or the result do does es not define a valid directed probabilistic mo model. del. Oneassign wa way y directions to the edges. When doing we imp must a directed cycle in Din Dso, to assign directions to the edges is to impose osenot an create ordering on the random , or the result does not define a valid directed probabilistic model. One way to is to impose an ordering on the random D assign directions to the edges in 580 D
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a a
b
h1
h2
h3
v1
v2
v3
h1
h2
h3
v1
v2
v3
b
c c
a a
b
b
c c
Figure 16.11: Examples of conv converting erting directed mo models dels (top row) to undirected mo models dels (b (bottom ottom row) by constructing moralized graphs. (L (Left) eft) This simple chain can be conv converted erted Figure 16.11: Examples of conv directed mo delsedges (top with row)undirected to undirected moThe dels to a moralized graph merely by erting replacing its directed edges. (L eft) (b ottom row) by constructing moralized graphs. This simple c hain can b e conv erted resulting undirected mo model del implies exactly the same set of indep independences endences and conditional to a moralized merely by replacing its directed with mo undirected indep independences. endences.graph (Center) This graph is the simplestedges directed model del that edges. cannotThe be resulting mo del mo implies exactlylosing the same of indep endences conditional con conv vertedundirected to an undirected model del without somesetindep independences. endences. Thisand graph consists indep endences. This graph aisand theb simplest directed mo del that cannot be en entirely tirely of a single(Center) immorality immorality. . Because are parents of c , they are connected by an converted an undirected mo delTwithout losing independences. This graph consists activ active e pathtowhen c is observ observed. ed. o capture this some dep dependence, endence, the undirected mo model del must entirely aofclique a single immorality. all Because and b areThis parents of fails c , they are connected an include encompassing three avariables. clique to enco encode de the factbythat activ e path when c is observ ed. T o capture this dep endence, the undirected mo del must a⊥b. (Right) In general, moralization may add man many y edges to graph, thus losing many include aindep clique encompassing all threethis variables. This to enco de themoralizing fact that implied independences. endences. For example, sparse co coding dingclique graphfails requires adding a b. (Right) general, may add y edges to athe graph, thus losing edges b et etw weenInevery pairmoralization of hidden units, thusman introducing quadratic num numb b er ofmany new implied indep endences. F or example, this sparse co ding graph requires adding moralizing ⊥ direct dependences. edges b etween every pair of hidden units, thus introducing a quadratic numb er of new direct dependences.
581
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
a
b
a
b
a
b
d
c
d
c
d
c
Figure 16.12: Con Converting verting an undirected mo model del to a directed mo model. del. (L (Left) eft) This undirected mo model del cannot be con conv verted directed to a directed mo model del because it has a lo loop op of length four (L eft) Figure 16.12: Con verting an undirected mo del to a directed mo del. This undirected with no chords. Sp Specifically ecifically ecifically,, the undirected mo model del enco encodes des two different indep independences endences that model cannotmo bedel concan verted directed to a directed mo it has four no directed model capture simultaneously: a⊥del c | {because b, d} and b⊥ad lo | {op a, of c} length . (Center) with no chords. ecifically, mo thedel undirected model mo enco des w tw different independences that T o conv convert ert the Sp undirected model to a directed model, del, e omust triangulate the graph, , d e and no directed mo delallcan capture simultaneously: c bhav b d T c .so,(Center) b y ensuring that lo loops ops of greater than lengtha three have a chord. To oa,do we can T o conv ert an theedge undirected moadel tocaordirected mo we }must triangulate graph, ⊥ del, | { edge ⊥ | {b and } the either add connecting and we can add an connecting d. In this by ensuring allto lo ops than length three e a chord. To the do so, can example, we that cho hoose ose add of thegreater edge connecting a and c. hav (R (Right) ight) To finish con conv vwe ersion either add an edge connecting a and c or w e can add an edge connecting b and d . In this pro process, cess, we must assign a direction to each edge. When doing so, we must not create an any y ight) example,cycles. we choOne ose to edge connecting . (R o finish ov the directed waadd y tothe av avoid oid directed cyclesa isand to cimp impose ose an T ordering over er con theversion no nodes, des, pro cess, ustt assign a direction each that edge.comes Whenearlier doing in so,the weordering must nottocreate an and alw alwa aw yse pmoin oint eac each h edge from thetonode the no node dey directed cycles. One w a y to av oid directed cycles is to imp ose an ordering ov er the no des, that comes later in the ordering. In this example, we use the variable names to imp impose ose and alwetical ays porder. oint each edge from the node that comes earlier in the ordering to the no de alphab alphabetical that comes later in the ordering. In this example, we use the variable names to imp ose alphab etical order.
variables, then point each edge from the no node de that comes earlier in the ordering to the no node de that comes later in the ordering. See Fig. 16.12 for a demonstration. variables, then point each edge from the node that comes earlier in the ordering to the node that comes later in the ordering. See Fig. 16.12 for a demonstration.
16.2.7
Factor Graphs
F actor gr graphs aphs are Graphs another way of dra drawing wing undirected mo models dels that resolv resolvee an 16.2.7 Factor am ambiguit biguit biguity y in the graphical represen representation tation of standard undirected mo model del syntax. F graphs aremo another ay of dra wingφundirected models resolv e an Inactor an undirected model, del, thewscop scope e of every function must be a that subset of some ambiguit in graph. the graphical represen of standard del clique in ythe Ho How wev ever, er, it is tation not necessary thatundirected there existmo an any y φ syntax. whose φ In an undirected mo del, the scop e of every function must b e a subset of some scop scopee contains the entiret entirety y of every clique. Factor graphs explicitly represent the φ clique in the graph. Ho w ev er, it is not necessary that there exist an y whose scop scopee of each φ function. Specifically Specifically,, a factor graph is a graphical representation scop contains the entiret y ofconsists every clique. Factor undirected graphs explicitly the of ane undirected mo model del that of a bipartite graph.represent Some of the φ function. scop e of Specifically ,des a factor graph graphical representation no nodes des areeach drawn as circles. These no nodes corresp correspond ond is toarandom variables as in a of an undirected mo del that consists of a bipartite undirected graph. Some of the standard undirected mo model. del. The rest of the no nodes des are drawn as squares. These no des are drawn as circles. These no des corresp ond to random v ariables as in a no nodes des corresp correspond ond to the factors φ of the unnormalized probability distribution. standard undirected del. rest ofwith the no des are drawn V ariables and factors mo may be The connected undirected edges.asAsquares. variable These and a φ no des corresp ond to the factors of the unnormalized probability distribution. factor are connected in the graph if and only if the variable is one of the arguments V and factors may be connected with undirected edges.No A factor variablemay andbae toariables the factor in the unnormalized probability distribution. factor are connected in the graph if and only if the v ariable is one of the arguments connected to another factor in the graph, nor can a variable be connected to a to the factor in the unnormalized probability distribution. No factor may be connected to another factor in the graph, nor can a variable be connected to a 582
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
variable. See Fig. 16.13 for an example of how factor graphs can resolve ambiguit ambiguity y in the in interpretation terpretation of undirected net netw works. variable. See Fig. 16.13 for an example of how factor graphs can resolve ambiguity in the interpretation of undirected networks. a
a
b
b
a
f2
f1 f1
c
b
c
f3
c
Figure 16.13: An example of how a factor graph can resolv resolvee am ambiguit biguit biguity y in the interpretation of undirected netw networks. orks. (L (Left) eft) An undirected netw network ork with a clique in involving volving three Figure 16.13: example of how aAfactor can resolvonding e ambiguit in the interpretation v ariables: a , bAnand c. (Center) factorgraph graph corresp corresponding to ythe same undirected eft) of undirected networks. undirected with a clique volving three mo model. del. This factor graph (L has oneAn factor over all netw threeork variables. (R (Right) ight)inAnother valid (Center) v ariables: a , b and c . A factor graph corresp onding to the same undirected factor graph for the same undirected model. This factor graph has three factors, each ov over er (Right) Another mo del. factor Represen graph has one factor overand all three variables. valid only twoThis variables. Representation, tation, inference, learning are all asymptotically cheaper factor graph for the same undirected This the factor graph has threegraph factors, each over in (c) compared to (b), ev even en though bmodel. oth require same undirected to represen represent. t. only two variables. Representation, inference, and learning are all asymptotically cheaper in (c) compared to (b), even though both require the same undirected graph to represent.
16.3
Sampling from Graphical Mo Models dels
Graphical mo models dels also facilitate the task of dra drawing wing dels samples from a mo model. del. 16.3 Sampling from Graphical Mo One adv advan an antage tage of directed graphical mo models dels is that a simple and efficien efficientt Graphical models also facilitate the task of drawing samples from a model. pro procedure cedure called anc ancestr estr estral al sampling can pro produce duce a sample from the join jointt distribution One ted advban of del. directed graphical models is that a simple and efficient represen represented y tage the mo model. procedure called ancestral sampling can produce a sample from the joint distribution The basic idea is to sort the variables xi in the graph into a topological ordering, represented by the model. so that for all i and j, j is greater than i if xi is a paren parentt of xj . The variables idea is toinsort theorder. variables x in the graphweinto a topological ordering, can The thenbasic be sampled this In other words, first sample x 1 ∼ P ( x 1), i j j i so that for all and , is greater than if x is a paren t of x . The v ariables then sample P ( x2 | P a G(x 2 )) )),, and so on, un until til finally we sample P (xn | P aG (x n)) )).. can then b e sampled in this order. In other w ords, we first sample x x ), ( P So long as each conditional distribution p ( xi | P a G(xi )) is easy to sample from, P ( xmo P (x P (x )). P a is(xeasy then sample )), to andsample so on, from. until finally weological samplesorting ∼aeration then the whole model del The top topological op operation p ( x distributions P a (x )) isineasy So longtees as each conditional to sample from, | can read distribution | guaran guarantees that we the conditional Eq.16.1 and sample then the model is easy the to sample from. sorting op | The top from themwhole in order. Without top topological ological sorting, weological might attempt to eration sample guaran tees that we can read the conditional distributions in Eq. 16.1 and sample a variable before its paren parents ts are available. from them in order. Without the topological sorting, we might attempt to sample For some graphs, moretsthan top topological ological ordering is possible. Ancestral a variable before its paren are aone vailable. sampling ma may y be used with an any y of these top topological ological orderings. For some graphs, more than one topological ordering is possible. Ancestral Ancestral sampling is generally very fast (assuming sampling from each condisampling may be used with any of these topological orderings. tional is easy) and con conv venien enient. t. Ancestral sampling is generally very fast (assuming sampling from each condi583 tional is easy) and convenient.
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
One drawbac drawback k to ancestral sampling is that it only applies to directed graphical mo models. dels. Another drawbac drawback k is that it do does es not supp support ort every conditional sampling One drawbac k to ancestral sampling is that it only to directed op operation. eration. When we wish to sample from a subset ofapplies the variables in agraphical directed mo dels. Another drawbac k is that it do es not supp ort every conditional sampling graphical mo model, del, given some other variables, we often require that all the conditionop eration. When wish than to sample from a subset of the variables in a directed ing variables comewe earlier the variables to be sampled in the ordered graph. graphical model, givensample some other ariables, often require that all the conditionIn this case, we can from vthe lo local cal we conditional probability distributions ingecified variables come earlier than the variables to bthe e sampled in the ordered graph. sp specified by the mo model del distribution. Otherwise, conditional distributions we In this case, w e can sample from the lo cal conditional probability distributions need to sample from are the posterior distributions given the observed variables. specified by thedistributions model distribution. Otherwise, the conditional distributions we These posterior are usually not explicitly sp specified ecified and parametrized need sample from are the pposterior osteriordistributions distributionscan given the observed variables. in theto mo model. del. Inferring these be costly costly. . In mo models dels where These p osterior distributions are usually not explicitly sp ecified and parametrized this is the case, ancestral sampling is no longer efficien efficient. t. in the model. Inferring these posterior distributions can be costly. In models where Unfortunately Unfortunately,, ancestral sampling is only applicable to directed mo models. dels. We this is the case, ancestral sampling is no longer efficient. can sample from undirected mo models dels by con converting verting them to directed mo models, dels, but this Unfortunately , ancestral sampling is only applicable directedthe momarginal dels. We often requires solving intractable inference problems (to to determine can sample from mo by con them to directed models, but this distribution ov over er undirected the ro root ot no nodes desdels of the newverting directed graph) or requires in intro tro troducing ducing often requires intractable inferencemo problems (to determine the Sampling marginal so many edgessolving that the resulting directed model del becomes in intractable. tractable. distribution ov er the ro ot no des of the new directed graph) or requires in tro ducing from an undirected mo model del without first conv converting erting it to a directed mo model del seems to so manyresolving edges that the resulting directed model becomes intractable. Sampling require cyclical dep dependencies. endencies. Every variable interacts with every other an undirected mo del without first conv it to a directed del seems to, vfrom ariable, so there is no clear beginning point forerting the sampling pro process. cess.mo Unfortunately Unfortunately, require cyclical endencies. Every vmo ariable with every other dra drawing wingresolving samples from an dep undirected graphical model del isinteracts an exp expensive, ensive, multi-pass vpro ariable, so there is no clear b eginning p oint for the sampling pro cess. Unfortunately process. cess. The conceptually simplest approac approach h is Gibbs sampling. Supp Suppose ose we, dravwing samples mo from undirected graphical vector model of is random an expensive, multi-pass x. We ha hav e a graphical model delanov over er an n-dimensional variables pro cess. The conceptually simplest approac h is Gibbs sampling . Supp ose we iterativ iteratively ely visit each variable xi and draw a sample conditioned on all of the other x. We ve a graphical er an n-dimensional vectorproperties of random variables (x i del | xov vha ariables, from p mo erties of the graphical −i). Due to the separation prop iterativ ely visit each v ariable x and draw a sample conditioned on all of the other, mo model, del, we can equiv equivalently alently condition on only the neighbors of xi . Unfortunately Unfortunately, p (x one x pass vafter ariables, from ). Due to thethe separation erties of sampled the graphical we ha have ve made through graphicalprop mo model del and all n mo del, we can equiv alently condition on only the neighbors of x . Unfortunately | variables, we still do not ha hav ve a fair sample from p(x). Instead, we must rep repeat eat the, after weand have made one through thethe graphical del and sampled all n pro process cess resample all npass variables using up updated datedmo values of their neighbors. p ( x v ariables, w e still do not ha v e a fair sample from ) . Instead, w e m ust rep eat the Asymptotically Asymptotically,, after man many y rep repetitions, etitions, this pro process cess con conv verges to sampling from process anddistribution. resample all nIt vcan ariables using the updated values their neighbors. the correct be difficult to determine whenofthe samples hav havee Asymptotically , after man y rep etitions, this pro cess con v erges to sampling from reac reached hed a sufficiently accurate appro approximation ximation of the desired distribution. Sampling the correct for distribution. can e difficult to determine when samples havine tec techniques hniques undirected It mo models delsbare an adv advanced anced topic, cov covered eredthe in more detail reac hed a sufficiently accurate appro ximation of the desired distribution. Sampling Chapter 17. techniques for undirected models are an advanced topic, covered in more detail in Chapter 17.
16.4
Adv dvan an antages tages of Structured Mo Modeling deling
The primary adv advantage antage of using structured probabilistic mo models dels is that they allow 16.4 Adv an tages of Structured Modeling us to dramatically reduce the cost of representing probability distributions as well The primary advantage of using structured probabilistic models is that they allow us to dramatically reduce the cost of representing probability distributions as well 584
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
as learning and inference. Sampling is also accelerated in the case of directed mo models, dels, while the situation can be complicated with undirected models. The as learning and inference. Sampling also op accelerated theless caseruntime of directed primary mechanism that allo allows ws all ofisthese operations erations toinuse and mo dels, while the situation can be complicated with undirected models. The memory is choosing to not mo model del certain interactions. Graphical mo models dels conv convey ey primary mechanism that allo ws all of these op erations to use less runtime and information by leaving edges out. Anywhere there is not an edge, the mo model del memory is choosing to not mowdel interactions. models convey sp specifies ecifies the assumption that e docertain not need to mo model del aGraphical direct in interaction. teraction. information by leaving edges out. Anywhere there is not an edge, the model A less quan quantifiable tifiable benefit of using structured probabilistic mo models dels is that specifies the assumption that we do not need to model a direct interaction. they allow us to explicitly separate representation of knowledge from learning of A less or quan tifiable given benefit of using structured models delseasier is that kno knowledge wledge inference existing knowledge. Thisprobabilistic mak makes es our mo models to they allow to explicitly representation of knowledge learningand of dev develop elop andusdebug. We canseparate design, analyze, and ev evaluate aluate learningfrom algorithms knowledgealgorithms or inference given knowledge. This mak our mo dels endently easier to, inference that are existing applicable to broad classes of es graphs. Indep Independently endently, dev elopdesign and debug. e can design,the analyze, and evw aluate learning algorithms w e can modelsW that capture relationships e believ elieve e are imp important ortant in and our inference algorithms that are applicable to broad classes of graphs. Indep endently data. We can then combine these differen differentt algorithms and structures and obtain, w can designpro models that capturepthe relationships we believ e are ortant in our a eCartesian product duct of different ossibilities. It would be muc much h imp more difficult to data. W e can then combine these differen t algorithms and structures and obtain design end-to-end algorithms for ev every ery possible situation. a Cartesian product of different possibilities. It would be much more difficult to design end-to-end algorithms for every possible situation.
16.5
Learning ab about out Dep Dependencies endencies
A go goo od generativ generative e mo model del to accurately capture the distribution over the 16.5 Learning abneeds out Dep endencies observ observed ed or “visible” v variables ariables v. Often the differen differentt elements of v are highly A go o d generativ e mo del needs to accurately capture the distribution ovhermost the dep dependen enden endentt on eac each h other. In the context of deep learning, the approac approach v are observed orused “visible” variables Often the differen t elementssev oferal highly commonly to mo model del thesev.dep dependencies endencies is to introduce several laten latent t or dependenvtariables, on eachh .other. Indel thecan context of deepdep learning, thebetw approac hymost “hidden” The mo model then capture dependencies endencies etween een an any pair commonly to vmoindirectly del these dependencies is to introduce sevveral latent or of variablesused v i and dependencies endencies bet etw ween j indirectly,, via direct dep i and h , and h “hidden” variables, The mohdeland canv then capture dependencies between any pair direct dep dependencies endencies .bet etw ween j. of variables v and v indirectly, via direct dependencies between v and h , and A go goo odendencies mo model del ofbvetwhic which not any y latent variables would need to direct dep weenh hdid and v .contain an ha hav ve very large num numb bers of parents per no node de in a Bay Bayesian esian netw network ork or very large v A go o d mo del of whic h did not contain an y latent v ariables need to cliques in a Marko Markov v netw network. ork. Just representing these higher order would interactions is ha v e very large num b ers of parents p er no de in a Bay esian netw ork or very large costly—b costly—both oth in a computational sense, because the num umb ber of parameters that cliques in a Marko v netw ork. Just representing these higher order must be stored in memory scales exp exponentially onentially with the num umb b er of interactions members in is a costly—b because the number of bparameters that clique, butoth alsoininaacomputational statistical sense,sense, because this exp exponential onential num umb er of parameters must be astored in of memory exponentially the number of members in a requires wealth data toscales estimate accurately accurately.with . clique, but also in a statistical sense, because this exponential number of parameters When the mo model del is in intended tended to capture dep dependencies endencies betw etween een visible variables requires a wealth of data to estimate accurately. with direct connections, it is usually infeasible to connect all variables, so the graph When the model intended to capture depthat endencies between visible and variables must be designed toisconnect those variables are tightly coupled omit with direct connections, it is usually infeasible to connect all v ariables, so the graph edges bet etw ween other variables. An en entire tire field of machine learning called structur structuree m ust b e designed to connect those v ariables that are tightly coupled and le learning arning is devoted to this problem For a go goo od reference on structure learning,omit see edges b et w een other v ariables. An en tire field of machine learning called structur (Koller and Friedman, 2009). Most structure learning techniques are a form ofe learning is devoted to this problem For a good reference on structure learning, see 585 (Koller and Friedman, 2009). Most structure learning techniques are a form of
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
greedy searc search. h. A structure is prop proposed, osed, a mo model del with that structure is trained, then given a score. The score rewards high training set accuracy and penalizes greedy search. A .structure is structures proposed, awith model withnum thatber structure trained, mo model del complexity complexity. Candidate a small number of edgesisadded or then given a score. The score rewards high training set accuracy and p enalizes remo remov ved are then prop proposed osed as the next step of the searc search. h. The search pro proceeds ceeds to mo del complexity . Candidate structures with a small num ber of edges added or a new structure that is exp expected ected to increase the score. removed are then proposed as the next step of the search. The search proceeds to Using latent variables instead of adaptiv adaptivee structure av avoids oids the need to perform a new structure that is expected to increase the score. discrete searc searches hes and multiple rounds of training. A fixed structure ov over er visible Using latent v ariables instead of adaptiv e structure av oids the need to perform and hidden variables can use direct interactions betw etween een visible and hidden units discrete searc hes and multiple rounds of training. A fixed structure ov er visible to impose indirect in interactions teractions bet etw ween visible units. Using simple parameter and hidden variableswcan direct interactions etween visible and learning tec techniques hniques e canuse learn a mo model del with a bfixed structure that hidden imputesunits the to impose indirect in teractions b et w een visible units. Using simple parameter righ rightt structure on the marginal p(v). learning techniques we can learn a model with a fixed structure that imputes the Laten Latentt variables hav havee adv advantages antages beyond their role in efficiently capturing p(v ). right structure on the marginal p(v). h The new variables also provide an alternative representation for v. For example, p(v ). Latent variables have ,adv beyond their role indelefficiently as discussed in Sec. 3.9.6 theantages mixture of Gaussians mo model learns a capturing laten latentt variable h which vwn The variables also provide an alternative . For example, that new corresp corresponds onds to category of examplesrepresentation the input wasfor dra drawn from. This as discussed in Sec. 3.9.6 , the mixture of Gaussians mo del learns a laten t v ariable means that the latent variable in a mixture of Gaussians mo model del can be used to do that corresp onds to which category of examples the input was dradels wn from. This classification. In Chapter 14 we saw ho how w simple probabilistic mo models like sparse means latent variable that in a can mixture of Gaussians del canfor beaused to do co coding ding that learnthe latent variables be used as input mo features classifier, classification. In Chapter 14 we sawOther how simple sparse or as co coordinates ordinates along a manifold. mo models delsprobabilistic can be usedmo in dels this like same wa way y, co ding learn latent v ariables that can b e used as input features for a classifier, but deep deeper er mo models dels and mo models dels with differen differentt kinds of interactions can create even or as co ordinates along a manifold. Other models can be used infeature this same way, ric richer her descriptions of the input. Man Many y approaches accomplish learning but deeper mo delst vand modelsOften, with differen t kinds ofdel interactions create even b y learning laten latent ariables. given some mo model of v and hcan , exp experimen erimen erimental tal ric her descriptions of the input. Man y approaches accomplish feature learning observ observations ations show that E [ h | v] or argmaxh p( h, v ) is a go goo od feature mapping for v and h, experimental b y learning laten t v ariables. Often, given some mo del of v. E observations show that [ h v] or argmax p( h, v ) is a good feature mapping for v. |
16.6
Inference and Appro Approximate ximate Inference
One main wa ways ys we can Appro use a probabilistic model del is to ask questions ab about out 16.6of theInference and ximate mo Inference ho how w variables are related to eac each h other. Giv Given en a set of medical tests, we can ask One of the main wa ys we can use a probabilistic model mo is to ask abtout what disease a patient might ha hav ve. In a latent variable model, del, wequestions migh mightt wan want to how variables areE[related to each other. Given of medical tests, wewe canneed ask h | v] describing v. Sometimes extract features the observ observed edavset ariables what disease patient in might hato ve.perform In a latent mo migh t wan tdels to to solve such aproblems order othervariable tasks. W e del, oftenwe train our mo models E h v [ v extractthe features ] describing the observ ed variables . Sometimes we need using principle of maxim maximum um lik likeliho eliho elihoo od. Because to solve such problems in order to p erform other tasks. We often train our models | using the principle likelihood. Because logofp(maxim v) = Eum (16.9) h∼p(h|v) [log p(h, v ) − log p(h | v )] , E p(v) =p (h | v ) in[log p(hto , vimplement ) log p(h a vlearning )] , we often wan wantt to log compute order rule. (16.9) All of − we must | predict the value of these are examples of infer inferenc enc encee problems in which we often want to compute p (h v ) in order to implement a learning rule. All of these are examples of inference| problems in which we must predict the value of 586
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
some variables given other variables, or predict the probability distribution over some variables giv given en the value of other variables. some variables given other variables, or predict the probability distribution over Unfortunately Unfortunately,, for most interesting deep mo models, dels, these inference problems are some variables given the value of other variables. in intractable, tractable, ev even en when we use a structured graphical mo model del to simplify them. The Unfortunately , for most interesting deep mo dels, these inference problems are graph structure allows us to represent complicated, high-dimensional distributions intractable, even when useparameters, a structured moused del tofor simplify them. The with a reasonable num numb bwe er of butgraphical the graphs deep learning are graph structure allows us to represent complicated, high-dimensional distributions usually not restrictiv restrictivee enough to also allo allow w efficien efficientt inference. with a reasonable number of parameters, but the graphs used for deep learning are It is straightforw straightforward ard to see that computing the marginal probability of a general usually not restrictive enough to also allow efficient inference. graphical mo model del is #P hard. The complexity class #P is a generalization of the It is straightforw to see that computing the marginalonly probability general complexit complexity y class NP NP..ard Problems in NP require determining whetherofaaproblem graphical model #P hard. The complexity class #P is aingeneralization of ting the has a solution andisfinding a solution if one exists. Problems #P require coun counting complexit NP. Problems in NP require determining onlymo whether a problem the num umb by erclass of solutions. To construct a worst-case graphical model, del, imagine that has a solution and finding a solution if one exists. Problems in #P require coun we define a graphical mo model del ov over er the binary variables in a 3-SA 3-SAT T problem. ting We the nimp umbose er of solutions.distribution To construct aerworst-case graphical del,then imagine can impose a uniform ov over these variables. Wemo can add that one w e define a graphical over that the binary variables in eac a 3-SA T problem. We binary laten latent t variablemo perdel clause indicates whether each h clause is satisfied. can imp ose a uniform distribution ov er these v ariables. W e can then add one We can then add another latent variable indicating whether all of the clauses are binary laten t vcan ariable per clause indicates h clause aisreduction satisfied. satisfied. This be done withoutthat making a largewhether clique, eac by building W e can then add anotherwith latent variable indicating of the clauses are tree of latent variables, each no node de in the tree whether rep reporting ortingallwhether two other Thissatisfied. can be done making a large by building a reduction vsatisfied. ariables are Thewithout leav leaves es of this tree are clique, the variables for eac each h clause. tree of latent v ariables, with each no de in the tree rep orting whether t w The ro root ot of the tree rep reports orts whether the entire problem is satisfied. Dueotoother the vuniform ariablesdistribution are satisfied. The leav es of this tree are the v ariables for eac h clause. over the literals, the marginal distribution ov over er the ro root ot of the The root oftree thesp tree repwhat orts whether entire problem is satisfied. Due to the reduction specifies ecifies fraction the of assignments satisfy the problem. While uniform distribution over the example, literals, the distribution over thein ropractical ot of the this is a con contriv triv trived ed worst-case NPmarginal hard graphs commonly arise reduction specifies what fraction of assignments satisfy the problem. While real-w real-world orld tree scenarios. this is a contrived worst-case example, NP hard graphs commonly arise in practical This motiv motivates ates the use of appro approximate ximate inference. In the con context text of deep real-world scenarios. learning, this usually refers to variational inference, in which we approximate the motivates of appro inference. In the qcon of is deep | v )use v ) that p( hthe (h|text trueThis distribution by seeking anximate approximate distribution as learning, this usually refers to v ariational inference, in which we approximate the close to the true one as possible. This and other techniques are describ described ed in depth h v h v p ( q ( true distribution ) by seeking an approximate distribution ) that is as in Chapter 19. close to the true one as | possible. This and other techniques are describ | ed in depth in Chapter 19.
16.7
The Deep Learning Approac Approach h to Structured Probabilistic Mo Models dels 16.7 The Deep Learning Approach to Structured ProbDeep learning practitioners generally use the same basic computational to tools ols as abilistic Mo dels other mac machine hine learning practitioners who work with structured probabilistic mo models. dels. Deep learning generally usewethe samemak basic computational tools as Ho How wev ever, er, in thepractitioners context of deep learning, usually make e different design decisions other learning these practitioners who work structured probabilistic mothat dels. ab about outmac ho how whine to combine to tools, ols, resulting inwith ov overall erall algorithms and mo models dels However, in the context of deep learning, we usually make different design decisions about how to combine these tools, resulting in overall algorithms and models that 587
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
ha hav ve a very different flav flavor or from more traditional graphical mo models. dels. Deep learning do does es not alwa always ys inv involve olve esp especially ecially deep graphical mo models. dels. In the have a very different flavor from more traditional graphical models. con context text of graphical mo models, dels, we can define the depth of a mo model del in terms of the Deep learning do es not alwa ys inv olve esp ecially deep graphical Inof the graphical mo model del graph rather than the computational graph. We mo candels. think a con text of graphical mo dels, we can define the depth of a mo del in terms of the laten latentt variable h i as being at depth j if the shortest path from h i to an observed del graph ratherdescrib than the computational graph. e canthe think of a j steps. vgraphical ariable is mo We usually describe e the depth of the mo model del asW being greatest h has j if the h to an latent v depth observed h ib depth ofariable any suc such . eing Thisat kind of depth is shortest differentpath fromfrom the depth induced by j vthe ariable is steps. W e usually describ e the depth of the mo del as b eing the greatest computational graph. Man Many y generative mo models dels used for deep learning hav havee no h depth of any suc h . This kind of depth is different from the depth induced by laten latentt variables or only one lay layer er of latent variables, but use deep computational the computational Many generative mowithin dels used fordel. deep learning have no graphs to define thegraph. conditional distributions a mo model. latent variables or only one layer of latent variables, but use deep computational Deep learning essentially alwa always ys makes use of the idea of distributed represengraphs to define the conditional distributions within a model. tations. Ev Even en shallo shallow w mo models dels used for deep learning purp purposes oses (suc (such h as pretraining Deep learning essentially alwa ys makes use of the idea of distributed shallo shallow w mo models dels that will later be comp composed osed to form deep ones) nearlyrepresenalwa always ys tations. Even shallo used deep learning purposes (suc as pretraining ha hav ve a single, large w la lay ymo er dels of laten latent t vfor ariables. Deep learning mo models delsh typically hav havee shallolaten w mot dels that than will later be comp osed to form deepnonlinear ones) nearly always more latent variables observed variables. Complicated interactions ha v e a single, large la y er of laten t v ariables. Deep learning mo dels typically have bet etw ween variables are accomplished via indirect connections that flo flow w through more laten t variables than observed variables. Complicated nonlinear interactions m ultiple laten latent t variables. between variables are accomplished via indirect connections that flow through By contrast, traditional graphical mo models dels usually contain mostly variables that multiple latent variables. are at least occasionally observ observed, ed, ev even en if man many y of the variables are missing at By contrast, traditional graphical mo dels usually contain mostly that random from some training examples. Traditional mo models dels mostly usevariables higher-order are atand least occasionally observ ed, even if many ofnonlinear the variables are missing at terms structure learning to capture complicated in interactions teractions bet etw ween someare training raditional models use higher-order vrandom ariables.from If there latentexamples. variables, T they are usually fewmostly in num numb ber. terms and structure learning to capture complicated nonlinear interactions between The wa way that latent variables are designed differs The variables. Ify there are latent variables, they are also usually few in in deep numblearning. er. deep learning practitioner typically do does es not in intend tend for the latent variables to The wa y that latent v ariables are designed also differs in algorithm deep learning. tak on an sp semantics ahead of time—the training is freeThe to takee any y specific ecific deep learning practitioner typically do es not in tend for the latent v ariables to in inv ven entt the concepts it needs to mo model del a particular dataset. The laten latentt variables are take on not anyvery specific ahead of time—the training algorithm is free to usually easy semantics for a human to in interpret terpret after the fact, though visualization invhniques ent the concepts it to model a particular dataset. t variables are tec techniques ma may y allo allow w needs some rough characterization of whatThe theylaten represent. When usually not very easy for a human to in terpret after the fact, though visualization laten latentt variables are used in the context of traditional graphical mo models, dels, they are techniques may with allowsome somesp rough of what they When often designed specific ecific characterization semantics in mind—the topicrepresent. of a do document, cument, laten t variables of area student, used in the of traditional graphical models, they are the intelligence the context disease causing a patient’s symptoms, etc. These often designed with some sp ecific semantics in mind—the topic of a do cument, mo models dels are often muc much h more interpretable by human practitioners and often hav havee the intelligence of a student, the disease causing a patient’s symptoms, etc. These more theoretical guaran guarantees, tees, yet are less able to scale to complex problems and are models are often h ymore interpretable human practitioners and often have not reusable in asmuc man many differen different t con contexts texts by as deep mo models. dels. more theoretical guarantees, yet are less able to scale to complex problems and are Another obvious difference is the kind of connectivity typically used in the not reusable in as many different contexts as deep models. deep learning approach. Deep graphical models typically hav havee large groups of units Another obvious difference is the kind of connectivity used in een the that are all connected to other groups of units, so that the typically interactions betw etween learningmay approach. Deep models typically have large groups of units tdeep wo groups be describ described edgraphical by a single matrix. Traditional graphical mo models dels that are all connected to other groups of units, so that the interactions between two groups may be described by a single 588 matrix. Traditional graphical mo dels
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
ha hav ve very few connections and the choice of connections for each variable may be individually designed. The design of the model structure is tightly linked with havechoice very few connections and the choice of connections for each variable may be the of inference algorithm. Traditional approac approaches hes to graphical models The of theofmodel tightly with tindividually ypically aim designed. to maintain thedesign tractability exact structure inference. isWhen thislinked constraint the choice of inference algorithm. T raditional approac hes to graphical models is to too o limiting, a popular approximate inference algorithm is an algorithm called typically aimpr toop maintain tractability exact inference. When constraint lo loopy opy belief prop opagation agation agation.. the Both of these of approaches often w ork wthis ell with very is to o limiting, a p opular approximate inference algorithm is an algorithm called sparsely connected graphs. By comparison, mo models dels used in deep learning tend to lo opy b elief pr op agation . Both of these approaches workhwcan ell with very connect eac each h visible unit vi to very many hidden units often hj, so that pro provide vide a sparsely connected graphs. By comparison, mo dels used in deep learning tend to distributed representation of vi (and probably sev several eral other observed variables to too). o). h can connect eachrepresentations visible unit v tohav very many hidden unitsbut h , so thatthe pro a Distributed have e man many y adv advan an antages, tages, from poin oint t ofvide view distributed of v (and probably severaly,other observedrepresentations variables too). of graphicalrepresentation mo models dels and computational complexit complexity distributed Distributed representations have yielding many adv antages, the poin t of view ha hav ve the disadv disadvantage antage of usually graphs thatbut arefrom not sparse enough for of graphical mo dels and computational complexit y , distributed representations the traditional techniques of exact inference and lo loopy opy belief propagation to be ha vean the antage of usually are not bsparse for relev relevan ant. t. disadv As a consequence, one of yielding the mostgraphs strikingthat differences etw etween eenenough the larger the traditional techniques ofy exact inference and loopy bdels elief communit propagation be graphical mo models dels comm communit unit unity and the deep graphical mo models community y istothat relev Aspropagation a consequence, one of never the most etween themo larger lo loop op opy yanbt.elief is almost usedstriking for deepdifferences learning. bMost deep models dels graphical mo dels comm unit y and the deep graphical mo dels communit y is that are instead designed to make Gibbs sampling or variational inference algorithms loopy bt.elief propagation is almostis never used learning for deep mo learning. Most deep molarge dels efficien Another consideration that deep a very efficient. models dels contain are instead designed to make Gibbs sampling or variational inference algorithms n um umb ber of latent variables, making efficien efficient t numerical co code de essential. This provides efficien t. Another consideration is that deep learning mo dels contain a very large an additional motiv motivation, ation, besides the choice of high-level inference algorithm, for n um b er of latent v ariables, making efficien t n umerical co de essential. This provides grouping the units into lay layers ers with a matrix describing the in interaction teraction betw etween een an additional motiv ation, b esides the choice of high-level inference algorithm, for two lay layers. ers. This allows the individual steps of the algorithm to be implemented grouping the units into layers with a matrix describing the in teraction betwlik een with efficien efficient t matrix pro product duct op operations, erations, or sparsely connected generalizations, like e tblo w o lay ers. This allows the individual steps of the algorithm to b e implemented blocck diagonal matrix pro products ducts or conv convolutions. olutions. with efficient matrix product operations, or sparsely connected generalizations, like Finally Finally,, the deep learning approach to graphical mo modeling deling is characterized by block diagonal matrix products or convolutions. a mark marked ed tolerance of the unkno unknown. wn. Rather than simplifying the mo model del un until til Finally , the deep learning approach to graphical mo deling is characterized by all quantities we migh mightt wan wantt can be computed exactly exactly,, we increase the pow ower er of a mark ed tolerance of the unkno wn. Rather than simplifying the mo del un til the mo model del un until til it is just barely possible to train or use. We often use mo models dels all quantities we migh t wan t can b e computed exactly , w e increase the p ow er of whose marginal distributions cannot be computed, and are satisfied simply to dra draw w the mo del un til it is just barely p ossible to train or use. W e often use mo dels appro samples from these mo W e often train mo with an intractable approximate ximate models. dels. models dels whose marginal distributions cannot even be computed, and are simply to dra w ob objectiv jectiv jective e function that we cannot approximate in asatisfied reasonable amoun amount t of approximate models. We often model delsifwith an intractable time, but wesamples are stillfrom able these to approximately traintrain the mo model we can efficiently ob jectiv e function that we cannot even approximate in a reasonable amoun t of obtain an estimate of the gradient of such a function. The deep learning approac approach h time, but we are still able to approximately train the mo del if w e can efficiently is often to figure out what the minimum amoun amountt of information we absolutely obtainis,anand estimate of the gradient of such a function. The deep learning approac h need then to figure out how to get a reasonable approximation of that is often to figure out the minimum amount of information we absolutely information as quic quickly klywhat as possible. need is, and then to figure out how to get a reasonable approximation of that information as quickly as possible.
589
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
h1
h2
v1
h3
v2
h4
v3
Figure 16.14: An RBM drawn as a Marko Markov v netw network. ork.
16.7.1
Figure 16.14: An RBM drawn as a Markov network.
Example: The Restricted Boltzmann Mac Machine hine
16.7.1 Example: The Restricted Mac The restricte estricted d Boltzmann machine (RBM) (Boltzmann Smolensky, 1986 ) orhine harmonium is the quin quintessen tessen tessential tial example of ho how w graphical mo models dels are used for deep learning. The The r estricte d Boltzmann machine (RBM) ( Smolensky , 1986 ) er orof harmonium is the RBM is not itself a deep mo model. del. Instead, it has a single lay layer latent variables quintessen example of ho graphical mofor delsthe areinput. used for deep learning. that may btial e used to learn aw representation In Chapter 20, we The will RBM is not itself a deep mo del. Instead, it has a single lay er of latent v ariables see ho how w RBMs can be used to build many deep deeper er mo models. dels. Here, we show ho how w the that may b e used to learn a representation for the input. In Chapter 20 , w e will RBM exemplifies many of the practices used in a wide variet ariety y of deep graphical see ho w RBMs can b e used to build many deep er mo dels. Here, w e show ho w the mo models: dels: its units are organized into large groups called lay layers, ers, the connectivity RBM exemplifies many of in a wide variet y of deepdense, graphical b et etw ween lay layers ers is describ described edthe by practices a matrix, used the connectivity is relatively the mo dels: its units are organized into large groups called lay ers, the connectivity mo model del is designed to allow efficien efficientt Gibbs sampling, and the emphasis of the mo model del b et w een lay ers is describ ed b y a matrix, the connectivity is relatively dense, the design is on freeing the training algorithm to learn latent variables whose semantics mo is designed efficient Gibbs and thewill emphasis model w eredelnot sp specified ecified to byallow the designer. Later, sampling, in Sec. 20.2 , we revisit of thethe RBM in designdetail. is on freeing the training algorithm to learn latent variables whose semantics more were not specified by the designer. Later, in Sec. 20.2, we will revisit the RBM in The canonical RBM is an energy-based mo model del with binary visible and hidden more detail. units. Its energy function is The canonical RBM is an energy-based model with binary visible and hidden E (vis, h) = −b> v − c> h − v> W h, (16.10) units. Its energy function where b , c , and W are E unconstrained, real-valued, parameters. W e can (v, h) = b vreal-v c alued, h v learnable W h, (16.10) v and h, and the in see that the model is divided into t− wo groups interaction teraction − of units: − where , andisWdescrib are unconstrained, real-v learnable parameters. We can b , cthem W .alued, bet etw ween described ed by a matrix The mo model del is depicted graphically v and h,ect seeFig. that 16.14 the model divided tw groupsan of imp units: andofthe interaction in . As is this figureinto mak makes eso clear, importan ortan ortant t asp aspect this mo model del is W b et w een them is describ ed b y a matrix . The mo del is depicted graphically that there are no direct interactions betw etween een any tw two o visible units or betw etween een any in Fig. 16.14 . As this figure mak es clear, an imp ortan t asp ect of this mo ise two hidden units (hence the “restricted,” a general Boltzmann machine maydel hav have that thereconnections). are no direct interactions between any two visible units or between any arbitrary two hidden units (hence the “restricted,” a general Boltzmann machine may have The restrictions on the RBM structure yield the nice prop properties erties arbitrary connections). (h i | v p(h structure | v) = Πi pyield ) nice properties (16.11) The restrictions on the RBM the and and
p(h v) = Π p(h v) p(v || h) = Πip(v i || h). p(v h) =590 Π p(v h). | |
(16.11) (16.12) (16.12)
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Figure 16.15: Samples from a trained RBM, and its weigh eights. ts. Image repro reproduced duced with p ermission from LISA (2008). (L (Left) eft) Samples from a mo model del trained on MNIST, drawn FigureGibbs 16.15:sampling. Samples from trainedisRBM, and its weigh ts. Image repro using Eac Each h acolumn a separate Gibbs sampling pro process. cess.duced Eac Each hwith row (L eft) p ermission from LISAof(2008 ). 1,000Samples a mo del trained on MNIST, drawn represen represents ts the output another steps offrom Gibbs sampling. Successive samples are using Gibbs sampling. Eac h column is a separate Gibbs sampling cess. Eac h row highly correlated with one another. (Right) The corresp corresponding onding weigh eighttpro vectors. Compare represen ts the output steps of Gibbs samples are this to the samples andofwanother eights of 1,000 a linear factor mo model, del,sampling. shown in Successive Fig. 13.2. The samples (Right) highly withbone another. weighto t vbectors. Compare p (h)corresp here arecorrelated muc uch h b etter ecause the RBM priorThe is not onding constrained e factorial. The this tocan thelearn samples andfeatures weightsshould of a linear factor model, shown in Fig. On 13.2the . The samples RBM which app appear ear together when sampling. other hand, (h) sparse hereRBM are mposterior uch b etterp(bhecause RBM prior is not constrained to b e pfactorial. the | v ) isthe factorial, whilepthe co coding ding p osterior (h | v ) is The not, RBM learn which features should appear sampling. Onmodels the other so thecan sparse co coding ding mo model del ma may y be b etter fortogether feature when extraction. Other arehand, able p(h v ) isp(factorial, theha RBM posterior the sparse to hav ve both a non-factorial h) and a while non-factorial p(hco| ding v ). p osterior p(h v ) is not, so the sparse co ding mo|del may be b etter for feature extraction. Other models | are able p ( h ) p ( h v ) to have both a non-factorial and a non-factorial . The individual conditionals are simple to compute as| well. For the binary RBM
we obtain: compute as well. For the binary RBM The individual conditionals are simple to P ( h (16.13) v >W :,i + bi , = 1 | v ) = σ i we obtain: (16.13) +:,ib +,bi . P (hi = 01 | v) = 1σ −vσ W (16.14) v >W | P (hallow (16.14) vckW + sampling, b . = 0 for v)efficient = 1 σblo Together these prop properties erties block Gibbs which alternates bet etw ween sampling all of h sim simultaneously ultaneously simultaneously ultaneously ultaneously.. | − and sampling all of v sim T ogether these prop erties allow for efficient blo ck Gibbs sampling, which alternates Samples generated by Gibbs sampling from an RBM mo model del are sho shown wn in Fig. between sampling all of h simultaneously and sampling all of v simultaneously. 16.15 16.15. . an RBM mo del are shown in Fig. Samples generated by Gibbs sampling from Since the energy function itself is just a linear function of the parameters, it is 16.15. easy to tak takee deriv derivativ ativ atives es of the energy function. For example, Since the energy function itself is just a linear function of the parameters, it is ∂ easy to take derivatives of the energy orj .example, E (vfunction. , h) = −vFi h (16.15) ∂ Wi,j ∂ E (v, h) = v h . (16.15) These tw twoo prop properties—efficient erties—efficient efficientt deriv derivatives—mak atives—mak atives—makee ∂ W Gibbs sampling and efficien − that undirected mo training conv convenient. enient. In Chapter 18, we will see models dels may be These o properties—efficient Gibbs sampling and efficien t deriv atives—mak e trained bytw computing suc such h deriv derivativ ativ atives es applied to samples from the mo model. del. training convenient. In Chapter 18, we will see that undirected models may be Training the mo model del induces a representation h of the data v. We can often use trained by computing such derivatives applied to samples from the model. describee v. Eh∼p(h|v) [h] as a set of features to describ Training the model induces a representation h of the data v. We can often use E 591 e v . [h] as a set of features to describ
CHAPTER 16. STRUCTURED PROBABILISTIC MODELS FOR DEEP LEARNING
Ov Overall, erall, the RBM demonstrates the typical deep learning approach to graphical mo models: dels: represen representation tation learning accomplished via la lay yers of latent variables, Ov erall, the RBM demonstrates the typical deep learning approach to graphcom combined bined with efficien efficientt in interactions teractions betw etween een lay layers ers parametrized by matrices. ical models: representation learning accomplished via layers of latent variables, language of graphical mo models dels provides elegant, t, flexible and language comThe bined with efficien t interactions between an layelegan ers parametrized byclear matrices. for describing probabilistic mo models. dels. In the chapters ahead, we use this language, Theother language ofectiv graphical dels provides eleganof t, deep flexible and clear language among persp erspectiv ectives, es, to mo describ describe e a widean variety probabilistic mo models. dels. for describing probabilistic models. In the chapters ahead, we use this language, among other perspectives, to describe a wide variety of deep probabilistic models.
592
Chapter 17 Chapter 17
Mon Monte te Carlo Metho Methods ds te Mon Carlo Metho ds Randomized algorithms fall in into to two rough categories: Las Vegas algorithms and Mon Monte te Carlo algorithms. Las Vegas algorithms alw alwa ays return precisely the correct Randomized algorithms fall failed). into twoThese roughalgorithms categories:consume Las Vegas algorithms and answ answer er (or rep report ort that they a random amount Mon te Carlo usually algorithms. Las or Vegas algorithms alwaMonte ys return precisely the correct of resources, memory time. In con contrast, trast, Carlo algorithms return answ er (or rep ort that they failed). These algorithms consume a random amount answ answers ers with a random amount of error. The amount of error can typically b e of resources, usually or time.(usually In contrast, Monte algorithmsFreturn reduced by exp expending endingmemory more resources running timeCarlo and memory). or any answers with a random amount of error. amount error typically be fixed computational budget, a Monte Carlo The algorithm canofpro provide videcan an appro approximate ximate reduced answ answer. er. by exp ending more resources (usually running time and memory). For any fixed computational budget, a Monte Carlo algorithm can provide an approximate Man Many y problems in machine learning are so difficult that we can nev never er exp expect ect to answer. obtain precise answers to them. This excludes precise deterministic algorithms and y problems in machine difficult thatappro we can neveralgorithms exp ect to Las Man Vegas algorithms. Instead, learning we mustare usesodeterministic approximate ximate obtain precise answers to them. ThisBoth excludes precise algorithms and or Monte Carlo appro approximations. ximations. approac approaches hes deterministic are ubiquitous in machine Las V egas algorithms. Instead, we m ust use deterministic appro ximate algorithms learning. In this chapter, we fo focus cus on Monte Carlo metho methods. ds. or Monte Carlo approximations. Both approaches are ubiquitous in machine learning. In this chapter, we fo cus on Monte Carlo metho ds.
17.1
Sampling and Mon Monte te Carlo Metho Methods ds
Man Many y imp importan ortan ortantt tec technologies hnologies to te accomplish learning 17.1 Sampling and used Mon Carlo machine Metho ds goals are based on drawing samples from some probability distribution and using these samples to Manyaimp ortan t technologies to accomplish machine form Mon Monte te Carlo estimate used of some desired quantit quantity y. learning goals are based on drawing samples from some probability distribution and using these samples to form a Monte Carlo estimate of some desired quantity.
17.1.1
Wh Why y Sampling?
There many reasons that we ma may y wish to draw samples from a probability 17.1.1are Wh y Sampling? distribution. Sampling provides a flexible way to appro approximate ximate many sums and There are many reasons that we may wish to draw samples from a probability distribution. Sampling provides a flexible 593 way to approximate many sums and 593
CHAPTER 17. MONTE CARLO METHODS
in integrals tegrals at reduced cost. Sometimes we use this to provide a significan significantt sp speedup eedup to a costly but tractable sum, as in the case when we subsample the full training cost integrals at reducedIncost. use this to provide a significan t sp eedup to with minibatches. otherSometimes cases, our we learning algorithm requires us to appro approximate ximate a costly but tractable as in theas case we subsample full training cost an in intractable tractable sum or sum, integral, such thewhen gradient of the log the partition function of with minibatches. In other cases, our learning algorithm requires us to appro ximate an undirected mo model. del. In man many y other cases, sampling is actually our goal, in the an intractable integral, such as thecan gradient the the log training partitiondistribution. function of sense that we wsum ant or to train a mo model del that sampleoffrom an undirected mo del. In many other cases, sampling is actually our goal, in the sense that we want to train a mo del that can sample from the training distribution.
17.1.2
Basics of Monte Carlo Sampling
When or anofintegral computed exactly (for example the sum 17.1.2a sum Basics Montecannot Carlob eSampling has an exp exponential onential num numb b er of terms and no exact simplification is known) it is When a sum or an integral computed (forThe example sum often p ossible to approximatecannot it usingb eMon Monte te Carloexactly sampling. idea isthe to view has sum an exp ofasterms and no exact simplification is known) is the oronential integral num as ifb er it w an exp expectation ectation under some distribution andit to oftenoximate p ossiblethe to exp approximate Mon te Carlo sampling. appr approximate expe ectation byitausing corr orresp esp esponding onding aver average age age.. Let The idea is to view the sum or integral as if it was an exp ectation under some distribution and to X approximate the expectation s =by a cporr (xesp )f (onding x) = Epaver [f (xage )] . Let (17.1) x
or or
s= p(x)f (x) = E [f (x)] Z s = p(x)f (x)dx = E p[f (x)]
(17.1)
(17.2)
s= X p(x )f (x)dx as = an E [exp f (xectation, )] (17.2) b e the sum or in integral tegral to estimate, rewritten expectation, with the constraint that p is a probabilit probability y distribution (for the sum) or a probability density (for the b e the sum or in tegral estimate, in integral) tegral) over random to variable x. rewritten as an exp ectation, with the constraint that p is a probability distribution (for the sum) or (1) a probability density (for the s byZdrawing n samples x , . . . , x(n) from p and then We can appro approximate ximate integral) over random variable x. forming the empirical av average erage We can approximate s by drawing n samples x , . . . , x from p and then n forming the empirical average 1X sˆn = f (x(i)). (17.3) n i=1 1 sˆ = f (x ). (17.3) This approximation is justified by an few different prop properties. erties. The first trivial observ observation ation is that the estimator sˆ is unbiased, since This approximation is justified by a few different prop erties. The first trivial n n observation is that the estimator since X(i) 1 Xsˆ is unbiased, 1X E[sˆn] = E[f (x )] = s = s. (17.4) n n i=1 i=1 1 1 E E [sˆ ] = [f (x )] = s = s. (17.4) n n ( i ) But in addition, the law of lar large ge numb numbers ers states that if the samples x are i.i.d., then the average conv converges erges almost surely to the exp expected ected value: But in addition, the law of large numbers states that if the samples x are i.i.d., X surely X then the average converges almost ected value: lim sˆnto = the s, exp (17.5) n→∞
lim 594 sˆ = s,
(17.5)
CHAPTER 17. MONTE CARLO METHODS
ar[[f (x(i) )] pro provided vided that the variance of the individual terms, Var )],, is b ounded. To see ar[[sˆn] this more clearly clearly,, consider the variance of sˆn as n increases. The variance Var V ar [ f ( x pro vided that the v ariance of the individual terms, )] , is b ounded. T o see ( i ) decreases and con conv verges to 0, so long as Var[f (x )] < ∞: s ˆ n V ar [sˆ ] this more clearly, consider the variance of as increases. The variance n decreases and converges to 0, so long as ar[f (x )] < : X 1 V Var[ ar[ˆ sˆn] = 2 Var[f (x)] ∞ (17.6) n 1 i=1 Var[sˆ ] = Var[f (x V)] ar[f (x)] (17.6) n = . (17.7) n Var[f (x)] . (17.7) This conv convenien enien enientt result also tells us=how to uncertaintty in a Monte n estimate the uncertain Xof exp Carlo av average erage or equiv equivalently alently the amount expected ected error of the Monte Carlo This conv enien t result also tells us how to estimate the uncertain a Monte appro approximation. ximation. We compute b oth the empirical av average erage of the f(xt(yi))inand their Carlo average or equiv alently amount of exp ected error ofbythe Carlo 1 and empirical variance, then the divide the estimated variance theMonte num number ber of f ( x appro ximation. W e compute b oth the empirical av erage of the ) and their ar[[sˆ n]. The centr samples n to obtain an estimator of Var entral al limit the theor or orem em tells us that empirical v ariance, and then divide the estimated v ariance b y thewith nummean ber of s the distribution of the av average, erage, sˆn, conv converges erges to a normal distribution Var[f (x)] n V ar [ s ˆ samples to obtain an estimator of ] . The c entr al limit the or em tells us that and variance . This allo allows ws us to estimate confidence interv intervals als around the the distribution ofn the erage, sˆ e, distribution converges to of a normal distribution estimate sˆn, using the av cumulativ cumulative the normal densit density y. with mean s and variance . This allows us to estimate confidence intervals around the Ho How wev ever, er, all this relies on our abilit ability y to easily sample from the base distribution estimate sˆ , using the cumulative distribution of the normal density. p(x ), but doing so is not alw alwa ays p ossible. When it is not feasible to sample from Ho w ev er, all this relies on our abilitysampling, to easily sample frominthe base distribution p, an alternativ alternativee is to use imp importance ortance presented Sec. 17.2 . A more p(x ), but doing soisistonot alwaays p ossible.of When it is not to sample general approach form sequence estimators thatfeasible conv converge erge to tow wardsfrom the p , an alternativ e is to use imp ortance sampling, presented in Sec. 17.2 . A more distribution of interest. That is the approac approach h of Monte Carlo Marko Markov v chains general approach is to form a sequence of estimators that conv erge to w ards the (Sec. 17.3). distribution of interest. That is the approach of Monte Carlo Markov chains (Sec. 17.3).
17.2
Imp Importance ortance Sampling
An imp importan ortan ortant t step in the decomp decomposition osition of the integrand (or summand) used by 17.2 Imp ortance Sampling the Mon Monte te Carlo metho method d in Eq. 17.2 is deciding whic which h part of the in integrand tegrand should An imp ortan t step in the decomp osition of the integrand (or summand) used the by pla the role the probability ) and which part of the integrand should play p ( x play y the Mon te quan Carlotit metho in Eq.exp 17.2 is deciding whichthat partprobabilit of the integrand should role of the quantit tity y f (x)d whose expected ected value (under probability y distribution) plato yb the role the probability ) and decomp which part of the integrand the p(x p(x)fshould ( x) canplay is e estimated. There is no unique decomposition osition because alwa always ys f ( x role of the quan tit y ) whose exp ected v alue (under that probabilit y distribution) b e rewritten as is to b e estimated. There is no unique decomp p(osition x)f (x)because p(x)f ( x) can always p(x)f (x) = q (x) , (17.8) b e rewritten as q (x) p(x)f (x) p(x)f (x) = q (xpf) , (17.8) where we now sample from q and average q . In cases, we wish to compute q (xmany ) an exp expectation ectation for a given p and an f , and the fact that the problem is sp specified ecified where we now sample from q and average . In many cases, we wish to compute 1 Theectation unbiased for estimator variance often which the sum an exp a givenof pthe and an f , is and thepreferred, fact thatinthe problem is of spsquared ecified differences is divided by n − 1 instead of n.
595
CHAPTER 17. MONTE CARLO METHODS
from the start as an exp expectation ectation suggests that this p and f would b e a natural choice of decomp decomposition. osition. How However, ever, the original sp specification ecification of the problem may p and f would from the start as an exp ectation suggests that this b e atonatural not b e the the optimal choice in terms of the num numb b er of samples required obtain cahoice of decomp osition. How ever, the original sp ecification of the problem ∗ given level of accuracy accuracy.. F Fortunately ortunately ortunately,, the form of the optimal choice q canmay be not b e the the optimal choice in terms of the num b er of samples required to obtain ∗ deriv derived ed easily easily.. The optimal q corresp corresponds onds to what is called optimal imp importance ortance a given level of accuracy. Fortunately, the form of the optimal choice q can b e sampling. derived easily. The optimal q corresp onds to what is called optimal imp ortance Because of the identit identity y shown in Eq. 17.8, an any y Monte Carlo estimator sampling. n X Because of the identity shown in 17.8, an(yi) Monte Carlo estimator 1 Eq. sˆp = f (x ) (17.9) n ( i ) 1 i=1,x ∼p sˆ = f (x ) (17.9) n can b e transformed in into to an imp importance ortance sampling estimator n X can b e transformed into an imp sampling 1 ortance p(x(i))f (estimator x(i)) sˆq = . (17.10) X q (x(i)) n ( i ) 1 i=1,x ∼q p(x )f (x ) sˆ = . (17.10) n value of theqestimator (x ) We see readily that the exp expected ected do does es not dep depend end on q :
We see readily that the exp ected do es not dep end on q: E qthe [sˆp ] estimator = s. (17.11) Eq [vsˆalue q ] = of E X E [sˆ ] = sampling [sˆ ] = estimator s. (17.11) Ho How wev ever, er, the variance of an importance can b e greatly sensitive to the choice of q . The variance is given by However, the variance of an importance sampling estimator can b e greatly sensitive p(x)f (x) to the choice of q . The variance is given by Var[ ar[ˆ sˆq ] = V Var[ ar[ ]/n. (17.12) q (x) p(x)f (x) Var[sˆ ] = Var[ (17.12) The minim minimum um variance o ccurs when q is q (x) ]/n. p(x)|f (x)| The minimum variance o ccurs when q is , (17.13) q∗ (x) = Z p(x) f (x) , (17.13) q (x) = chosen so that where Z is the normalization constant, to q ∗ (x) sums or integrates Z | | 1 as appropriate. Better imp importance ortance sampling distributions put more weigh eightt where where is the normalization constant, c hosen so that ) sums or integrates Z q ( x the integrand is larger. In fact, when f( x ) do does es not change sign, Var [sˆq ∗ ] = to 0, 1meaning as appropriate. Better imp ortance sampling distributions put more w eigh t where that a single sample is sufficien sufficientt when the optimal distribution is x ) do es not of Vtially ar [sˆ solv the integrand is this larger. In fact, when change ] = ed 0, q ∗ hassign, used. Of course, is only b ecause thef(computation essen essentially solved a single sample is sufficien t when meaning that thethis optimal distribution is the original problem, so it is usually not practical to use approach of drawing q used. Of course, this is only b ecause the computation of has essen tially solv ed a single sample from the optimal distribution. the original problem, so it is usually not practical to use this approach of drawing An Any y choice of sampling distribution q is valid (in the sense of yielding the a single sample from the optimal distribution. correct expected value) and q ∗ is the optimal one (in the sense of yielding minim minimum um q An y c hoice of sampling distribution is v alid (in the sense of yielding the ∗ variance). Sampling from q is usually infeasible, but other choices of q can b e correct expected and qtheis vthe optimal one (in the sense of yielding minimum feasible while stillvalue) reducing ariance somewhat. variance). Sampling from q is usually infeasible, but other choices of q can b e feasible while still reducing the variance596somewhat.
CHAPTER 17. MONTE CARLO METHODS
Another approach is to use biased imp which h has the importance ortance sampling , whic adv advantage antage of not requiring normalized p or q. In the case of discrete variables, the biasedis imp ortance sampling , which has the Another approach is to use biased imp importance ortance sampling estimator giv given en by p q advantage of not requiring normalized or . In the case of discrete variables, the P (i) biased imp ortance sampling estimatorn is p(x given) by (i) i=1 q(x (i)) f (x ) sˆBIS = (17.14) P n p(x(i) ) f ( x ) i=1 q(x(i) ) sˆ =P (17.14) n p(x (i) ) ( i ) i=1 q˜(x(i)) f (x ) = (17.15) P n p(x(i) ) f ( x ) ( i ) i=1 q˜(x ) P =P (17.15) n p˜(x(i) ) (i) ) f ( x i=1 P q˜(x(i)) = , (17.16) P n p˜(x(i) ) f ( x ) i=1 q˜(x (i) ) =P , (17.16)
(i) where p˜ and q˜ are the unnormalized forms P of p and q and the x are the samples from q . This estimator is biased b ecause E[ sˆBIS ] 6= s, except asymptotically when q˜ are x this where and the the unnormalized of pverges and q to and are the samplesis denominator of Eq.Pforms 17.14E con conv 1. the Hence estimator n → ∞p˜ and from q .asymptotically This estimatorun isbiased. biased b ecause [ sˆ ] = s, except asymptotically when called unbiased. P converges and the denominator of Eq. 17.14 to 1. Hence this estimator is n 6 q Although a go goo o d choic choice e of can greatly improv improve e the efficiency of Monte Carlo called → ∞asymptotically unbiased. estimation, a p o or choice of q can make the efficiency muc uch h worse. Going back p(xof )|fMonte (x)| q Although a go o d choic e of can greatly improv e the efficiency Carlo to Eq. 17.12, we see that if there are samples of q for which is large, q(x) estimation, a p o orofchoice of q can make thevery efficiency uch worse. Going back then the variance the estimator can get large. mThis may happ happen en when q to Eq. 17.12 , w e see that if there are samples of for which is large, q (x) is tiny while neither p (x) nor f (x ) are small enough to cancel it. The q then the variance of the estimator get simple very large. This may happiteniswhen distribution is usually chosen to b e can a very distribution so that easy ) is tinyfrom. whileWhen neither ) nor ) are small enough to cancel it. The qto(xsample p ( x f ( x x is high-dimensional, this simplicit simplicity y in q causes it toq distribution so that sampling it is easy (x(i) )|fdistribution (x(i))|, imp | p oorly psimple matc match h p or pis|fusually oorly.chosen . Whentoq (bxe(ia)) very importance ortance x q to sample from. When is high-dimensional, this simplicit y in causes it to collects useless samples (summing tiny num umbers bers or zeros). On the other hand, when (x happ ) ) f rarely (x ) ,, the p (fi) )|pfoorly p (xmore impratio ortance qmatc (x(i)h) p (x(i) ). |,When whic cansampling b e huge. or p(x which h qwill happen en rarely, collects useless samples (summing tiny n um bers or zeros). On the other hand, when | | | | Because these latter even events ts are rare, they may not show up in a typical sample, q (x ) typical ) f (x ) , which of will en more rarely rarely,by the ratio can b e huge. p(x underestimation s , happ yielding comp compensated ensated gross ov overestimation. erestimation. Because these latter even ts are rare, they may not show up in a typical sample, large |or very |small num Suc Such h very numb b ers are typical when x is high dimensional, yielding ypical of s , comp ensated rarelyprobabilities by gross overestimation. b ecause tin highunderestimation dimension the dynamic range of joint can b e very Such very large or very small numb ers are typical when x is high dimensional, large. b ecause in high dimension the dynamic range of joint probabilities can b e very In spite of this danger, imp importance ortance sampling and its variants hav havee b een found large. very useful in man many y mac machine hine learning algorithms, including deep learning algorithms. In spite of this danger, ortance its variantstraining have b een found For example, see the use ofimp imp importance ortancesampling samplingand to accelerate in neural vlanguage ery usefulmo in man machine learning algorithms, including algorithms. models dels ywith a large vocabulary (Sec. 12.4.3.3deep ) orlearning other neural nets F or example, see the use of imp ortance sampling to accelerate training in neural with a large num numb b er of outputs. See also ho how w imp importance ortance sampling has b een language mo dels with a large vocabulary (Sec. 12.4.3.3 ) or other nets used to estimate a partition function (the normalization constant of aneural probability with a large numb er of outputs. See also how imp ortance sampling has b een 597 normalization constant of a probability used to estimate a partition function (the
CHAPTER 17. MONTE CARLO METHODS
distribution) in Sec. 18.7, and to estimate the log-likelihoo log-likelihood d in deep directed mo models dels suc such h as the variational auto autoenco enco encoder, der, in Sec. 20.10.3. Imp Importance ortance sampling may distribution) in Sec. 18.7e ,the andestimate to estimate the gradient log-likelihoo d incost deepfunction directedused mo dels also b e used to improv improve of the of the to suc h as the v ariational auto enco der, in Sec. 20.10.3 . Imp ortance sampling may train mo model del parameters with sto stocchastic gradien gradientt descen descent, t, particularly for mo models dels also b e used to improv e the estimate of the gradient of the cost function used suc such h as classifiers where most of the total value of the cost function comes fromtoa train mo del parameters with sto chastic gradien t descen particularly for mo dels small num number ber of misclassified examples. Sampling moret,difficult examples more such as tly classifiers where of theoftotal value oft the cost function comes from frequen frequently can reduce themost variance the gradien gradient in suc such h cases (Hinton , 2006 ). a small number of misclassified examples. Sampling more difficult examples more frequently can reduce the variance of the gradient in such cases (Hinton, 2006).
17.3
Mark Marko ov Chain Mon Monte te Carlo Metho Methods ds
In man many y cases, weov wish to use aMon Monte technique but ds there is no tractable 17.3 Mark Chain teCarlo Carlo Metho metho method d for drawing exact samples from the distribution pmodel (x) or from a go good od In man y cases, we wish to use a Monte Carlo technique but there is no tractable (lo (low w variance) importance sampling distribution q (x ). In the con context text of deep p ( x metho d for drawing exact samples from the distribution ) or a go od learning, this most often happ happens ens when pmodel (x ) is represented by anfrom undirected xol (lo wdel. variance) samplinga mathematical distribution q (to ). called In thea Markov context chain of deep mo model. In theseimportance cases, we introduce tool to learning, this most often happ ens when ) is represented by an undirected p ( x appro approximately ximately sample from pmodel (x). The family of algorithms that use Marko Markov v mo del. In these cases, we introduce a mathematical to ol called a Markov chain to chains to p erform Mon Monte te Carlo estimates is called Markov chain Monte Carlo p (Monte x). The appro ximately sample from family of algorithms that learning use Marko v metho methods ds (MCMC). Marko Markov v chain Carlo metho methods ds for machine are cdescribed hains ed to p te Carlo estimates called (Markov chainmost Monte Carlo describ aterform greaterMon length in Koller and Fis riedman 2009). The standard, methodsguarantees (MCMC). Marko v chaintec Monte Carlo ds for machine are generic for MCMC techniques hniques are metho only applicable whenlearning the model describ edassign at greater in Koller andstate. Friedman (2009).it The mostconv standard, do does es not zero length probabilit probability y to any Therefore, is most convenien enien enientt generic guarantees for MCMC tec hniques are only applicable when the model to presen presentt these tec techniques hniques as sampling from an energy-based mo model del (EBM) not assign zero probabilit y to any state. it is most convenien exp (−E (x )) as pdo (xes) ∝ describ described ed in Sec. 16.2.4.Therefore, In the EBM form formulation, ulation, everyt to presen t these tecto hniques as sampling from an energy-based mo del state is guaranteed ha hav ve non-zero probability probability. . MCMC methods are (EBM) in fact exp ( E ( x )) p ( x ) as describ ed in Sec. 16.2.4 . In the EBM form ulation, more broadly applicable and can b e used with many probabilit probability y distributionsevery that state is guaranteed to ha v e non-zero probability . MCMC methods are in ∝ − con contain tain zero probability states. Ho How wev ever, er, the theoretical guaran guarantees tees concerning fact the more broadly applicable and b ebused with many probabilit distributions that b eha ehavior vior of MCMC metho methods dscan must e prov proven en on a case-b case-by-case y-casey basis for different contain zero probability states.InHothe wevcontext er, the theoretical guarantees the families of such distributions. of deep learning, it is concerning most common b eha vioronof the MCMC ds must b e prov en on tees a case-b basis apply for different to rely mostmetho general theoretical guaran guarantees thaty-case naturally to all families of such distributions. In the context of deep learning, it is most common energy-based mo models. dels. to rely on the most general theoretical guarantees that naturally apply to all To understand why dra drawing wing samples from an energy-based mo model del is difficult, energy-based mo dels. consider an EBM ov over er just two variables, defining a distribution p(a, b). In order T o understand why drawing samples an in energy-based mo delb is difficult, b ), and p(a | from to sample a, we must draw a from order to sample , we must consider an EBM ov er just t w o v ariables, defining a distribution p ( a , b ) . In order dra draw w it from p(b | a). It seems to b e an intractable chic chick ken-and-egg problem. b ), and p(a graph to samplemo a,dels we av moid ust this drawb ecause a fromtheir in orderand to sample must Directed models avoid is directed acyclic. bT, owe p erform p(b one a). simply dra w italfrom It seems to b e each chicken-and-egg problem. |an intractable anc ancestr estr estral sampling samples of the variables in top topological ological order, Directed mo dels av oid this b ecause their graph is directed and acyclic. T o p erform | conditioning on eac each h variable’s parents, whic which h are guaranteed to hav havee already b een anc estr al sampling one simply samples each of the v ariables in top ological order, sampled (Sec. 16.3). Ancestral sampling defines an efficient, single-pass metho method d of conditioning on each variable’s parents, which are guaranteed to have already b een sampled (Sec. 16.3). Ancestral sampling598 defines an efficient, single-pass metho d of
CHAPTER 17. MONTE CARLO METHODS
obtaining a sample. In an EBM, we can av avoid oid this chic chick ken and egg problem by sampling using a obtaining a sample. Mark Marko ov chain. The core idea of a Marko Markov v chain is to hav havee a state x that b egins In an EBM, we can av oid this chic k en and egg problem beatedly y sampling using as an arbitrary value. Ov Over er time, we randomly up update date x rep repeatedly eatedly. . Ev Even en entually tuallya Mark ov chain. The core idea of sample a Markofrom v chain have a, state b egins x that x p ( x)is. to b ecomes (very nearly) a fair Formally ormally, a Marko Markov v chain is as an arbitrary value. state Over xtime, randomly distribution up date x repTeatedly entually (x0 | x.) Ev defined by a random and we a transition sp specifying ecifying x b ecomes p ( to x).state (verythat nearly) a fair up sample fromgo Formally Markoin v chain x0 if ,ita starts the probability a random update date will state xis. x T ( x x defined by a Mark random state means and arep transition distribution Running the Marko ov chain repeatedly eatedly up updating dating the state x to) sp a vecifying alue x0 x x. the probability that a random up date will go to state if it starts in state | sampled from T (x0 | x). Running the Markov chain means rep eatedly up dating the state x to a value x To gain some understanding of how MCMC metho methods ds work, it is sampled from T (xtheoretical x). useful to reparametrize the problem. First, we restrict our attention to the case | To the gainrandom some theoretical how states. MCMCIn metho work, is where variable xunderstanding has countably of many this ds case, we it can useful to reparametrize the problem. First, we restrict our attention to the case represen representt the state as just a p ositiv ositivee integer x. Differen Differentt in integer teger values of x map x where the random v ariable has countably many states. In this case, we can bac back k to differen differentt states x in the original problem. represent the state as just a p ositive integer x. Different integer values of x map what happ happens ens we run infinitely Markov v chains in parallel. backConsider to differen t states x inwhen the original problem. many Marko All of the states of the different Mark Marko ov chains are dra drawn wn from some distribution happ ensthe when w e run infinitely many Marko hains in parallel. q (t)(Consider x), wherewhat t indicates num numb b er of time steps that ha hav vve celapsed. At the All of the states the different Mark ov we chains wn frominitialize some distribution q (0) isofsome x for each b eginning, distribution that usedare to dra arbitrarily qMarko (xo),v where indicates numb erby of all time steps thatv ha ve elapsed. Athav the q (t) isthe Mark chain. tLater, influenced of the Marko Markov chain steps that have e q x b eginning, is some distribution that we used to arbitrarily initialize for each ( t ) run so far. Our goal is for q (x) to conv converge erge to p(x). Markov chain. Later, q is influenced by all of the Markov chain steps that have we hav have e reparametrized problem of p ositiv ositivee in integer teger x, we run Because so far. Our goal is for q (x) to the conv erge to pin(xterms ). can describ describee the probabilit probability y distribution q using a vector v , with Because we have reparametrized the problem in terms of p ositive integer x, we q (x = i)q=using v i. a vector v , with (17.17) can describ e the probability distribution (x up = idate ) = va .single Marko Consider what happ happens ens when qwe update Markov v chain’s state(17.17) x to a 0 0 new state x . The probabilit probability y of a single state landing in state x is giv given en by Consider what happ ens when weX up date a single Markov chain’s state x to a 0 (t) . state x is given (17.18) q(t+1) (x)landing T (x0 | x)in new state x . The probabilit y (ofx a) = single qstate by x
(17.18) q (x ) = q (x)T (x x). Using our integer parametrization, we can represent the effect of the transition | op operator erator T using a matrix A. We define A so that Using our integer parametrization, we can represent the effect of the transition T (x0 =Ai so | xthat = j ). (17.19) op erator T using a matrix AA. i,jW= e define X
Using this definition, we canA now rewrite 17.18 it in =T (x = iEq. x= j ). . Rather than writing (17.19) terms of q and T to understand how a single| state is up updated, dated, we may now use v Using definition, nowdistribution rewrite Eq.ov 17.18 Rather thanMarko writing it in A this and to describ describe e howwe thecan entire over er all .the different Markov v chains q and shifts T to understand terms a single run in of parallel as we applyhow an up update: date: state is up dated, we may now use v and A to describ e how the entire distribution over all the different Markov chains (t) = date: Av (t−1). (17.20) run in parallel shifts as we apply van up v
=599 Av
.
(17.20)
CHAPTER 17. MONTE CARLO METHODS
Applying the Marko Markov v chain update rep repeatedly eatedly corresp corresponds onds to multiplying by the matrix A rep repeatedly eatedly eatedly.. In other words, we can think of the pro process cess as exponentiating Applying the Marko v chain update rep eatedly corresp onds to multiplying by the the matrix A: A matrix rep eatedly. In other words, can t think v(t)we =A v(0) . of the pro cess as exponentiating (17.21) the matrix A: v =A v . eac (17.21)a The matrix A has sp special ecial structure b ecause each h of its columns represents probabilit probability y distribution. Suc Such h matrices are called sto stochastic chastic matric matrices es es.. If there is A The matrix has sp ecial structure b ecause eac h of its columns a non-zero probabilit probability y of transitioning from any state x to an any y otherrepresents state x0 fora probabilit y distribution. Such matrices called(sto chastic matric esenius . If there t, then the Perron-F some p ower erron-Frob rob robenius eniusare theorem Perron , 1907 ; Frob robenius , 1908is) a non-zero y of transitioning from other statewe for x to x can guaran guarantees tees probabilit that the largest eigenv eigenvalue alue is realany andstate equal to an 1. yOv Over er time, t, then some p owall er of the Palues erron-F rob enius theorem see that the eigenv eigenvalues are exp exponen onen onentiated: tiated: (Perron, 1907; Frob enius, 1908) guarantees that the largest eigenvalue is real and equal to 1. Over time, we can t (0) −1 onen see that all of the are v (t)eigenv = Values v tiated: diag diag((λ )Vexp = V diag(λ)t V −1v (0). (17.22)
v =allVof v that v to. 1 to decay diag (λeigenv )V = V diag λ) Vequal (17.22) This pro process cess causes the eigenvalues alues are (not to zero. Under some additional mild conditions, A is guaran guaranteed teed to ha have ve only This pro cess causes all of the eigenv alues that are not equal to 1 to decay to one eigenv eigenvector ector with eigenv eigenvalue alue 1. The pro process cess thus con conv verges to a stationary zero. Under some additional mild is guaran.teed to ha ve only distribution , sometimes also called theconditions, equilibriumAdistribution At conv convergence, ergence, one eigenvector with eigenvalue 1. The pro cess thus converges to a stationary 0 distribution, sometimes also calledv the equilibrium distribution. At convergence, = Av = v, (17.23)
= Avadditional = v, (17.23) and this same condition holds forv every step. This is an eigenv eigenvector ector equation. To b e a stationary p oin oint, t, v must b e an eigen eigenv vector with corresp corresponding onding and this same condition holds for every additional step. This is an eigenv ector eigen eigenv value 1. This condition guarantees that once we ha hav ve reac reached hed the stationary v equation. T o b e a stationary p oin t, m ust b e an eigen v ector with corresp onding distribution, rep repeated eated applications of the transition sampling pro procedure cedure do not eigen v alue 1 . This condition guarantees that once we ha v e reac hed the stationary change the distribution ov over er the states of all the various Mark Marko ov chains (although distribution, rep eated the transition pro cedure do not transition op operator erator do does esapplications change eachofindividual state,sampling of course). change the distribution over the states of all the various Markov chains (although If we hav havee chosen T correctly correctly,, then the stationary distribution q will b e equal transition op erator do es change each individual state, of course). to the distribution p we wish to sample from. We will describ describee ho how w to cho hoose ose T If w,einhav e chosen shortly shortly, Sec. 17.4. T correctly, then the stationary distribution q will b e equal to the distribution p we wish to sample from. We will describ e how to cho ose T Most properties Marko ov Chains with countable states can b e generalized shortly , inprop Sec.erties 17.4.of Mark to contin continuous uous variables. In this situation, some authors call the Marko Markov v Chain Most prop erties of Mark o v Chains with countable states can b e generalized a Harris chain but we use the term Mark Marko ov Chain to describ describee b oth conditions. to general, continuous variables. In with this situation, authors the Marko v Chain T willcall In a Mark Marko ov chain transition some op operator erator con conv verge, under mild a Harris chain but we use the term Mark o v Chain to describ e b oth conditions. conditions, to a fixed p oin ointt describ described ed by the equation In general, a Markov chain with transition op erator T will converge, under mild 0 conditions, to a fixed p oint describ y the equation q0 (x0 ) ed = Ebx∼q T (x | x), (17.24) E (x )rewriting = T (Eq. x x17.23 ), . When x is discrete, (17.24) whic which h in the discrete case isqjust the exp expectation ectation corresp corresponds onds to a sum, and when x contin tin tinuous, uous, the exp expectation ectation | is con x whic h in the discrete case is just rewriting Eq. 17.23 . When is discrete, the corresp corresponds onds to an in integral. tegral. exp ectation corresp onds to a sum, and when x is continuous, the exp ectation 600 corresp onds to an integral.
CHAPTER 17. MONTE CARLO METHODS
Regardless of whether the state is con contin tin tinuous uous or discrete, all Marko Markov v chain metho methods ds consist of rep repeatedly eatedly applying stochastic up updates dates until even eventually tually the state Regardless whether thethe state is continuous or discrete, all Marko v chain b egins to yield of samples from equilibrium distribution. Running the Mark Marko ov metho ds consist of rep eatedly applying stochastic up dates until even tually the state chain un until til it reac reaches hes its equilibrium distribution is called “ burning in” the Mark Markoov b egins to yield samples from the equilibrium distribution. Running the Mark ov chain. After the chain has reached equilibrium, a sequence of infinitely many chain unma til it hes equilibrium called “ burning ” the Mark ov samples may y breac e dra drawn wnitsfrom from thedistribution equilibrium is distribution. Theyinare iden identically tically chain. After has reached equilibrium, a sequence of infinitely distributed butthe anychain tw twoo successiv successive e samples will b e highly correlated with eac each h many other. samples ma y b e dra wn from from the equilibrium distribution. They are iden tically A finite sequence of samples ma may y th thus us not be very representativ representativee of the equilibrium distributed but any tw o successiv e samples will b e highly correlated with h other. distribution. One wa way y to mitigate this problem is to return only every successiv successive e n eac A finite sequence samples mayofththe us not be veryofrepresentativ e of the equilibrium samples, so that of our estimate statistics the equilibrium distribution is distribution. waycorrelation to mitigatebthis problem is to return only successiv e not as biasedOne by the et etw ween an MCMC sample andevery the n next several samples, Marko so thatv our estimate of exp theensiv statistics of bthe equilibrium distribution is samples. Markov chains are th thus us expensiv ensive e to use ecause of the time required to not as biased b y the correlation b et w een an MCMC sample and the next several burn in to the equilibrium distribution and the time required to transition from samples. Marko v chainsreasonably are thus exp ensive to use b ecause the hing time equilibrium. required to one sample to another decorrelated sample afterof reac reaching burn to the truly equilibrium distribution andone thecan timerun required to transition from If oneindesires indep independen enden endent t samples, multiple Marko Markov v chains oneparallel. sample to another reasonably decorrelated sample after reac hing equilibrium. in This approach uses extra parallel computation to eliminate latency latency.. If one desires truly indep enden t samples, one can run multiple Marko v chains The strategy of using only a single Marko Markov v chain to generate all samples and the in parallel. This one approach uses extra parallel computation to eliminate latency . strategy of using Mark v chain for eac desired sample are t w o extremes; deep Markoo each h The strategy of using usually only a single Marko chain to generate all samples the learning practitioners use a num number berv of chains that is similar to the and num numb b er strategy of using one Mark o v chain for eac h desired sample are t w o extremes; deep of examples in a minibatc minibatch h and then dra draw w as man many y samples as are needed from learning practitioners usually use a num ber of chains that the num b er this fixed set of Marko Markov v chains. A commonly used num numb b er isofsimilar Marko Markov vtochains is 100. of examples in a minibatch and then draw as many samples as are needed from Another difficulty is that we do not know in adv advance ance ho how w many steps the this fixed set of Markov chains. A commonly used numb er of Markov chains is 100. Mark Markoov chain must run b efore reaching its equilibrium distribution. This length of is that e do not very knowdifficult in advance howhether w many asteps thev timeAnother is calleddifficulty the mixing time.wIt is also to test Marko Markov Mark o v chain must run b efore reaching its equilibrium distribution. This length of chain has reached equilibrium. We do not hav havee a precise enough theory for guiding time called the mixing timeTheory . It is also difficult to test a Marko v us in is answering this question. tellsvery us that the chain willwhether conv converge, erge, but not chain reached We dovnot havfrom e a precise enough theory guiding A m uc uch h has more. If we equilibrium. analyze the Marko Markov chain the p oint of view of afor matrix us in answering this question. Theory tells us that the c hain will conv erge, but not acting on a vector of probabilities v, then we know that the chain mixes when A t muceffectiv h more.ely If lost we analyze Marko v chain the p oint of vieweigenv of a matrix A b esides has effectively all of thethe eigen eigenv values from from the unique eigenvalue alue ofA 1. A acting on a that vector probabilities , then we know the chain mixes when This means theofmagnitude of v the second largestthat eigenv will determine the eigenvalue alue has effectiv elyHow lostev all eigenvalues from A b esidesrepresent the unique alue of 1. mixing time. Howev ever, er,ofinthe practice, we cannot actually oureigenv Marko Markov v chain This means the magnitude secondthat largest alue willmo determine the in terms of athat matrix. The num umb bof erthe of states our eigenv probabilistic model del can visit mixing time. How ev er, in practice, we cannot actually represent our Marko v c hain is exp exponentially onentially large in the num number ber of variables, so it is infeasible to represent in, terms a matrix. The n um b er states probabilistic can visit v A, or of A the eigenv eigenvalues alues of . Dueofto thesethat andour other obstacles,mo wedel usually do is exp onentially in the variables, so itwe is simply infeasible not kno know w whetherlarge a Marko Markov v cnum hainber hasofmixed. Instead, runto therepresent Marko Markov v v A A , , or the eigenv alues of . Due to these and other obstacles, we usually do chain for an amoun amountt of time that we roughly estimate to b e sufficient, and use not kno w whether v chainwhether has mixed. wemixed. simply These run theheuristic Markov heuristic metho methods ds atoMarko determine the Instead, chain has cmethods hain for an amoun t of time that we samples roughly or estimate to bcorrelations e sufficient, band use metho ds include manually insp inspecting ecting measuring etw etween een heuristic ds to determine whether the chain has mixed. These heuristic successiv successiveemetho samples. metho ds include manually insp ecting samples or measuring correlations b etween successive samples. 601
CHAPTER 17. MONTE CARLO METHODS
17.4
Gibbs Sampling
So far we hav have e describ described ed how to dra draw w samples from a distribution q(x) by rep repeatedly eatedly 17.4 Gibbs Sampling 0 0 up updating dating x ← x ∼ T (x | x). How However, ever, we hav havee not describ described ed how to ensure that q( x) bin So far we hav e describ ed how to dra w samples from a distribution y rep eatedly q (x) is a useful distribution. Two basic approaches are considered this b ook. x x T x ( x up dating ) . How ever, w e hav e not describ ed how to ensure T p The first one is to deriv derivee from a given learned model , describ described ed b elo elow w withthat the q (x) of is sampling a useful TThe wo basic approaches considered in thisT band ook. ← distribution. ∼ | case from EBMs. second one is to are directly parametrize p defines The one is to e T from a given learned , describ b elowofwith the pmodel learnfirst it, so that itsderiv stationary distribution implicitly theed in interest. terest. case of sampling from EBMs. The second one is to directly parametrize and T. Examples of this second approach are discussed in Sec. 20.12 and Sec. 20.13 learn it, so that its stationary distribution implicitly defines the p of interest. In the of con context textsecond of deep learning, commonly use20.12 Marko Markov v cSec. hains20.13 to dra draw Examples this approach arewe discussed in Sec. and . w samples from an energy-based mo model del defining a distribution pmodel (x). In this case, In the conqtext of the deepMarko learning, we to commonly v chains draw (x) for (x).Marko we wan want t the Markov v chain b e pmodeluse To obtain the to desired p ( x samples from an energy-based mo del defining a distribution ) . In this case, q (x), we must choose an appropriate T (x0 | x). (x). To obtain the desired we want the q (x) for the Markov chain to b e p A conceptually simple and effective approach to building a Mark Marko ov chain q (x), we must choose an appropriate T (x x). that samples from p model(x) is to use Gibbs sampling, in which sampling from | simplebyand effective building a Mark ovpchain 0 | conceptually T (xA x ) is accomplished selecting one approach variable x itoand sampling it from model p ( x that samples from ) is to use Gibbs sampling , in which sampling from conditioned on its neighbors in the undirected graph G defining the structure of T (x energy-based p x ) is accomplished by selecting one v ariable x and sampling it from the mo It is also p ossible to sample sev v ariables at the same model. del. several eral conditioned its neighbors in the undirected graph theneigh structure of | long on time so as they are conditionally indep independen enden endent t giv given en defining all of their neighb b ors. As the energy-based moexample del. It isinalso sev variables the same G eral units sho shown wn in the RBM Sec.p ossible 16.7.1, to all sample of the hidden of anatRBM may time so long as they are conditionally indep enden t giv en all of their neigh b ors. As b e sampled sim simultaneously ultaneously b ecause they are conditionally indep independen enden endentt from eac each h sho wn in the RBM example in Sec. 16.7.1 , all of the hidden units of an RBM may other giv given en all of the visible units. Lik Likewise, ewise, all of the visible units ma may y b e sampled b e sampled simultaneously they are conditionally fromgiv eacen h sim simultaneously ultaneously b ecause theyb ecause are conditionally indep independen enden endenttindep fromenden eac each h tother given other all of the visible units. Likewise, all of the that visible units yb all of giv theenhidden units. Gibbs sampling approaches up update date ma man many yevsampled ariables sim ultaneously b ecause they are conditionally indep enden t from eac h other given sim simultaneously ultaneously in this wa way y are called blo block ck Gibbs sampling sampling.. all of the hidden units. Gibbs sampling approaches that up date many variables Alternate approaches to designing Marko Markov v chains to sample from p model are simultaneously in this way are called block Gibbs sampling. p ossible. For example, the Metrop Metropolis-Hastings olis-Hastings algorithm is widely used in other Alternate approaches to designing Markov approac chains to sample from pmo are disciplines. In the con context text of the deep learning approach h to undirected modeling, deling, p ossible. F or example, the Metrop olis-Hastings algorithm is widely used in other it is rare to use any approac approach h other than Gibbs sampling. Improv Improved ed sampling disciplines. In the con text of the deep learning approac h to undirected mo deling, tec techniques hniques are one p ossible researc research h fron frontier. tier. it is rare to use any approach other than Gibbs sampling. Improved sampling techniques are one p ossible research frontier.
17.5
The Challenge of Mixing bet etw ween Separated Mo Modes des
The primary difficulty inv involved olved of with MCMC metho methods dseen is that they hav havee a tendency 17.5 The Challenge Mixing betw Separated Modes to mix p oorly oorly.. Ideally Ideally,, successiv successivee samples from a Mark Markoov chain designed to sample The difficulty involved with MCMC methoeach ds isother that and theywhav e avisit tendency p(x ) would fromprimary b e completely indep independent endent from ould many to mix p oorly . Ideally , successiv e samples from a Mark o v c hain designed to sample differen differentt regions in x space prop proportional ortional to their probabilit probability y. Instead, esp especially ecially p ( x from ) would b e completely indep endent from each other and w ould visit in high dimensional cases, MCMC samples become very correlated. We many refer different regions in x space prop ortional to their probability. Instead, esp ecially 602 in high dimensional cases, MCMC samples become very correlated. We refer
CHAPTER 17. MONTE CARLO METHODS
to such b ehavior as slo slow w mixing or even failure to mix. MCMC metho methods ds with slo slow w mixing can b e seen as inadv inadverten erten ertently tly p erforming something resem resembling bling noisy to sucht bdescent ehavioron asthe slow mixing or evenorfailure to mix. MCMC metho gradien gradient energy function, equiv equivalently alently noisy hill clim climbing bingdsonwith the slo w mixing can b e seen as inadv erten tly p erforming something resem bling noisy probabilit probability y, with resp respect ect to the state of the chain (the random variables b eing gradien t descent on the energy equiv(in alently climstate bing on the sampled). The chain tends to function, take smallorsteps the noisy space hill of the of the probabilit y, with resp to the state of1)the (the random variables b eing x (t− x (t), with Mark Markoov chain), from a ect configuration to achain configuration the energy sampled). The chain tends to take small steps the energy space of of the 1) E (x(t) ) generally E (the x(t−state low lower er or approximately equal (in to the ), with a x x Mark o v chain), from a configuration to a configuration , with the energy preference for mov moves es that yield low lower er energy configurations. When starting from a E ( x E ones (x from ) generally low er or approximately equalthan to the ), with rather improbable configuration (higher energy theenergy typical p (x)a), preference for mov that yield lowerthe energy configurations. starting from a the chain tends to es gradually reduce energy of the state When and only o ccasionally rather improbable configuration (higher than athe typical from p (x )), mo mov ve to another mo mode. de. Once the chainenergy has found region of ones low energy (for the chain iftends gradually theanenergy stateofand o ccasionally example, the vto ariables are reduce pixels in image,ofathe region low only energy might b e mo v e to another mo de. Once the c hain has found a region of low energy a connected manifold of images of the same ob object), ject), which we call a mo mode, de, (for the theto variables are pixels in de an (follo image, a region energy might be cexample, hain will iftend walk around that mo mode (following wing a kind of of low random walk). Once a connected the ob ject), which moit de,finds the in a while it manifold will step of outimages of thatofmo mode desame and generally returnwe tocall it ora (if chain wille tend to w alkearound that mo de (follo wing a kind of random walk). Once an escap escape route) mov move to tow wards another mo mode. de. The problem is that successful in a while it will step out of that mo de and generally return to it or (if it escap escapee routes are rare for many interesting distributions, so the Marko Markov v chainfinds will an escap e route) mov e to w ards another mo de. The problem is that successful con to sample the same mo longer than it should. contin tin tinue ue mode de escap e routes are rare for many interesting distributions, so the Markov chain will This is very clear when we consider the Gibbs sampling algorithm (Sec. 17.4). continue to sample the same mo de longer than it should. In this context, consider the probabilit probability y of going from one mo mode de to a nearby This is very clear when we consider the Gibbs sampling algorithm (Sec. 17.4is). mo mode de within a giv given en num number ber of steps. What will determine that probability In this context, considerbarrier” the probabilit of going fromTransitions one mo de bto a een nearby the shap shape e of the “energy b etw etween eeny these mo modes. des. et etw w two mo de within a giv en num ber of steps. What will determine that probability is mo modes des that are separated by a high energy barrier (a region of lo low w probability) the exp shap e oftially the “energy barrier” b etwof een moof des. ransitions b etween twiso are exponen onen onentially less likely (in terms thethese height theTenergy barrier). This mo des that are separated b y a high energy barrier (a region of lo w probability) illustrated in Fig. 17.1. The problem arises when there are multiple mo modes des with are exp onentially less are likely (in terms the height of probability the energy, barrier). is high probabilit probability y that separated byofregions of low probability, esp especially eciallyThis when illustrated Fig. 17.1step . Themproblem arises when theresubset are multiple mo deswhose with eac each h Gibbsinsampling ust up update date only a small of variables y that are separated regions of low probability, esp ecially when vhigh aluesprobabilit are largely determined by theby other variables. each Gibbs sampling step must up date only a small subset of variables whose As a simple example, consider an energy-based mo model del ov over er two variables a and values are largely determined by the other variables. b, which are b oth binary with a sign, taking on values − 1 and 1. If E (a, b) = −w ab As a simple consider energy-based mo del over two variables a and for some large pexample, ositive num number ber w, an then the mo model del expresses a strong b elief that a E ( ) = w ab b , which are b oth binary with a sign, taking on v alues 1 and 1 . If a , b and b ha hav ve the same sign. Consider up updating dating b using a Gibbs sampling step with for some large p ositive num ber , then theb mo del expresses that w − 1 | a =b elief 1) =− σ( w)a. a = 11.. The conditional distribution ov over er is given by P ( b a=strong and ve the same sign. saturates, Consider up dating b using a Gibbs step w bisha If large, the sigmoid and the probability of alsosampling assigning b towith be P ( = 1 = 1) = σ ( w a a = 1 . The conditional distribution ov er b is given b y b ). 1 is close to 1. Likewise, if a = −1, the probabilit probability y of assigning b to b e −1 is If w is the sigmoid saturates, and probability of also assigning b to b e. | equally (a, b), b close tolarge, 1. According to P model oththe signs of b oth variables are likely likely. = 1 is close to 1. Likewise, if a 1 , the probabilit y of assigning b to b e 1 is According to P model( a | b), b oth variables should ha hav ve the same sign. This means close Gibbs to 1. Asampling ccording will to P only (very a, b−),rarely b oth signs of bsigns oth vof ariables equally − likely. that flip the these vare ariables. According to P ( a b), b oth variables should have the same sign. This means that Gibbs sampling will | only very rarely flip the signs of these variables. 603
CHAPTER 17. MONTE CARLO METHODS
Figure 17.1: Paths follow followed ed by Gibbs sampling for three distributions, with the Mark Marko ov chain initialized at the mo mode de in b oth cases. (L (Left) eft) A multiv multivariate ariate normal distribution Figure 17.1: Paths follow ed by Gibbs sampling formixes three well distributions, withvariables the Mark ov with tw two o indep independent endent variables. Gibbs sampling b ecause the are (L eft) cindep hain initialized at the mo de in b oth cases. A multiv ariate normal distribution independen enden endent. t. (Center) A multiv multivariate ariate normal distribution with highly correlated variables. withcorrelation two indep endent Gibbs itsampling mixes b ecause variables are The b etw etween eenvariables. variables makes difficult for the well Marko Markov v chainthe to mix. Because indep t. (Center) A multivconditioned ariate normalondistribution withcorrelation highly correlated ariables. eac each h venden ariable must b e updated the other, the reducesvthe rate The correlation b etw variables makes the Marko vt.chain to mix. Because at which the Mark Marko oveen chain can mov move e awaity difficult from theforstarting p oin oint. (R (Right) ight) A mixture of each variable must b e updated conditioned thenot other, the correlation themixes rate Gaussians with widely separated mo modes des thatonare axis-aligned. Gibbs reduces sampling (Right) atery which thebecause Markovitchain can mov way from thewhile starting p oint.only mixture v slo slowly wly is difficult toecahange mo modes des altering one A variable at ofa Gaussians with widely separated mo des that are not axis-aligned. Gibbs sampling mixes time. very slowly because it is difficult to change mo des while altering only one variable at a time.
In more practical scenarios, the challenge is ev even en greater b ecause we care not only ab about out making transitions b et etw ween tw two o mo modes des but more generally b etw etween een In more practical scenarios, the c hallenge is ev en greater b ecause we care not all the man many y mo modes des that a real mo model del might con contain. tain. If several such transitions onlydifficult ab out making b etween two mobdes but generally etween are b ecausetransitions of the difficulty of mixing etw etween een more mo modes, des, then it bbecomes man y emo thata areliable real moset delofmight concov tain. If several transitions vall erythe exp expensiv ensiv ensive todes obtain samples covering ering most ofsuch the mo modes, des, and are difficult b ecause of the difficulty of mixing b etw een mo des, then it b ecomes con conv vergence of the chain to its stationary distribution is very slow. very exp ensive to obtain a reliable set of samples covering most of the mo des, and b e resolveddistribution by finding groups highly dep dependen enden endentt convSometimes ergence of this the problem chain to can its stationary is very of slow. units and up updating dating all of them simultaneously in a blo blocck. Unfortunately Unfortunately,, when thisare problem can b eitresolved by finding groupsin of highly dep endenat the Sometimes dep dependencies endencies complicated, can b e computationally intractable tractable to draw units and upthe dating all After of them inthe a blo ck. v Unfortunately , when sample from group. all,simultaneously the problem that Marko Markov chain was originally the dep endencies are complicated, it can b e computationally in tractable to draw a in intro tro troduced duced to solv solvee is this problem of sampling from a large group of variables. sample from the group. After all, the problem that the Markov chain was originally In the context of mo models dels with laten latentt variables, which define a join jointt distribution intro duced to solve is this problem of sampling from a large group of variables. pmodel (x, h), we often draw samples of x by alternating b etw etween een sampling from In the context of mo dels with laten t v ariables, which define t distribution ointtaofjoin view of mixing pmodel (x | h ) and sampling from p model(h | x ). From the p oin p ( x , h x ) , we often draw samples of b y alternating b etw een sampling rapidly rapidly,, we would lik likee pmodel (h | x) to hav havee very high entrop entropy y. How However, ever, fromfrom the ) and sampling from ) . F rom the p oin t of view of mixing x h h x p ( p ( h h p oin ointt of view of learning a useful represen representation tation of , we would like to enco encode de (h x) to have| very high entropy. However, from the rapidly, we | would like p p oint of view of learning a useful | represen 604 tation of h, we would like h to enco de
CHAPTER 17. MONTE CARLO METHODS
Figure 17.2: An illustration of the slow mixing problem in deep probabilistic mo models. dels. Eac Each h panel should b e read left to right, top to b ottom. (L (Left) eft) Consecutiv Consecutivee samples from Figure sampling 17.2: Anapplied illustration theBoltzmann slow mixing problem in deep mo dels. Gibbs to a of deep machine trained on probabilistic the MNIST dataset. (Left) Eac h paneleshould b eare read left totoright, to b ottom. the Consecutiv e samples from Consecutiv Consecutive samples similar each top other. Because Gibbs sampling is p erformed Gibbs sampling applied to a deep Boltzmann machine trained on the MNIST dataset. in a deep graphical mo model, del, this similarit similarity y is based more on seman semantic tic rather than raw visual Consecutiv e samples aredifficult similar for to each other. chain Because the Gibbs from sampling pde erformed features, but it is still the Gibbs to transition one is mo mode of the in a deep graphical mo del, similarit based more on seman ticy.rather than raw visuale distribution to another, forthis example byy cishanging the digit iden identit tit tity (Right) Consecutiv Consecutive features, but it is still for theadversarial Gibbs chain to ork. transition from one mosampling de of the ancestral samples fromdifficult a generative netw network. Because ancestral (Right) distribution to sample another,indep for example y changing the digit titymixing . Consecutive generates each independently endently bfrom the others, thereiden is no problem. ancestral samples from a generative adversarial network. Because ancestral sampling generates each sample indep endently from the others, there is no mixing problem.
enough information ab about out x to reconstruct it well, which implies that h and x should hav havee very high mutual information. These two goals are at o dds with eac each h x h x enough information ab out to reconstruct it well, which implies that and other. We often learn generativ generativee mo models dels that very precisely enco encode de x in into to h but should hav e very high mutual information. These t w o goals are at o dds with each are not able to mix very well. This situation arises frequently with Boltzmann x h other. W e often learn generativ e mo dels that very precisely enco de in to but mac machines—the hines—the sharp sharper er the distribution a Boltzmann machine learns, the harder areis not to mix verysampling well. This situation arises frequentlytowith it for aable Marko Markov v chain from the mo model del distribution mixBoltzmann well. This mac hines—the sharp er the distribution a Boltzmann machine learns, the harder problem is illustrated in Fig. 17.2. it is for a Markov chain sampling from the mo del distribution to mix well. This All this could make MCMC metho methods ds less useful when the distribution of interest problem is illustrated in Fig. 17.2. has a manifold structure with a separate manifold for each class: the distribution All this couldaround make MCMC metho ds less useful when distribution of interest is concentrated man many y mo modes des and these mo modes des arethe separated by vast regions has a manifold structure with a separate manifold for each class: the distribution of high energy energy.. This typ ypee of distribution is what we exp expect ect in many classification is concentrated around man y mo des and these mo des are separated y vast of regions problems and would make MCMC metho methods ds conv converge erge very slo slowly wly bbecause p oor of high energy . This t yp e of distribution is what we exp ect in many classification mixing b et etw ween mo modes. des. problems and would make MCMC metho ds converge very slowly b ecause of p oor mixing b etween mo des.
17.5.1
Temp empering ering to Mix b et etw ween Mo Modes des
When has to sharp p eaks high Mo probability 17.5.1a distribution Temp ering Mix b etof ween des surrounded by regions of lo low w probabilit probability y, it is difficult to mix b et etw ween the different mo modes des of the distribution. When a distribution has sharp p eaks of high probability surrounded by regions of low probability, it is difficult to mix b etw605 een the different mo des of the distribution.
CHAPTER 17. MONTE CARLO METHODS
Sev Several eral techniques for faster mixing are based on constructing alternativ alternativee versions of the target distribution in whic which h the p eaks are not as high and the surrounding eralare techniques for mixing are based on constructing alternativ e versions vSev alleys not as lo low. w. faster Energy-based mo models dels provide a particularly simple wa way y to of the target distribution in whic h the p eaks are not as high and the surrounding do so. So far, we ha hav ve describ described ed an energy-based mo model del as defining a probability vdistribution alleys are not as low. Energy-based mo dels provide a particularly simple way to do so. So far, we have describped (xan ) ∝energy-based exp (−E (x))mo . del as defining a probability (17.25) distribution Energy-based mo models dels may b e paugmen augmented ted( with controlling trolling (x) exp E (xan )) .extra parameter β con (17.25) ho how w sharply p eak eaked ed the distribution is: ∝ ted − Energy-based mo dels may b e augmen with an extra parameter β controlling p ( x ) ∝ exp ( − β E (x)) . (17.26) β how sharply p eaked the distribution is: The β parameter is often describ described as b(eing reciprocal cal of the temp temper er eratur atur aturee, p (x)ed exp β Ethe (x))recipro . (17.26) reflecting the origin of energy-based mo models dels in statistical physics. When the ∝ as b eing − the recipro cal of the temperature, β The parameter is often describ ed temp temperature erature falls to zero and β rises to infinit infinity y, the energy-based mo model del b ecomes reflecting the origin of energy-based mo dels in statistical physics. the deterministic. When the temp temperature erature rises to infinity and β falls toWhen zero, the temp erature (for fallsdiscrete to zero x and β rises to infinity, the energy-based mo del b ecomes distribution ) b ecomes uniform. deterministic. When the temp erature rises to infinity and β falls to zero, the Typically ypically,, a mo model del is trained to b e ev evaluated aluated at β = 11.. How However, ever, we can make distribution (for discrete x) b ecomes uniform. use of other temp temperatures, eratures, particularly those where β < 1. Temp empering ering is a general β Typically , a mo del is trained to b e ev aluated at = 1 . How ever, w e can strategy of mixing b et een mo of p rapidly b y dra samples with β make < 1. etw w modes des drawing wing 1 use of other temp eratures, particularly those where β < 1. Tempering is a general Mark Marko oofv mixing chains based on mo temp temper er ereeofd ptr transitions ansitions 1994samples ) temp temporarily orarily strategy b etween des rapidly b(yNeal dra,wing with βsample < 1. from higher-temp higher-temperature erature distributions in order to mix to different mo modes, des, then Marksampling ov chainsfrom based onunit temptemp ered erature transitions (Neal, 1994 ) temp orarily sample resume the temperature distribution. These techniques hav havee from higher-temp erature distributions in order to mix to different mo des, then b een applied to mo such as RBMs ( Salakhutdino , 2010 ). Another approach is models dels Salakhutdinov v resume sampling from the unit temp erature distribution. These techniques hav to use par aral al allel lel temp tempering ering (Iba, 2001), in which the Marko Markov v chain simulates manye b een applied to mo dels such at as different RBMs (Salakhutdino v, 2010 Anothertemp approach is differen different t states in parallel, temp temperatures. eratures. The). highest temperature erature to use p ar al lel temp ering ( Iba , 2001 ), in which the Marko v chain simulates many states mix slowly slowly,, while the lo low west temp temperature erature states, at temp temperature erature 1, provide differen t states in parallel, at different temp eratures. The highest temp erature accurate samples from the mo model. del. The transition op operator erator includes sto stochastically chastically states mixstates slowly while west temp erature 1, tly provide sw swapping apping b,etw etween een the two lo differen different t temp temperature eraturestates, levels,atsotemp that erature a sufficien sufficiently highaccurate samples from the mo del. The transition op erator includes sto chastically probabilit probability y sample from a high-temp high-temperature erature slot can jump in into to a lo low wer temp temperature erature sw apping states b etw een t wo differen t temp erature levels, so that a sufficien tly slot. This approac approach h has also b een applied to RBMs (Desjardins et al., 2010; highCho probabilit y sample from a high-temp erature slot can jump in to a lo w er temp erature et al. 2010 ). Although temp is a promising approac at this p oint it has not al.,, tempering ering approach, h, slot. This approac h has also b een applied to RBMs ( Desjardins et al. , 2010 ; Cho allo allow wed researc researchers hers to make a strong adv advance ance in solving the challenge of sampling et al. , 2010 ). Although temp ering is a promising approac this al p oint iter has not from complex EBMs. One p ossible reason is that there h, areatcritic critical temp temper eratur atur atures es allo w ed researc hers to make a strong adv ance in solving the c hallenge of sampling around whic which h the temp temperature erature transition must b e very slo slow w (as the temp temperature erature is from complex EBMs. One p ossible reason is that there are critic al temp eratures gradually reduced) in order for temp tempering ering to b e effectiv effective. e. around which the temp erature transition must b e very slow (as the temp erature is gradually reduced) in order for temp ering to b e effective.
17.5.2
Depth Ma May y Help Mixing
When drawing wing samples a latent variable mo model del p(h, x), we hav havee seen that if 17.5.2dra Depth Mayfrom Help Mixing p(h | x) enco encodes des x to too o well, then sampling from p( x | h ) will not change x very When drawing samples from a latent variable mo del p(h, x), we have seen that if 606 from p( x h ) will not change x very p(h x) enco des x to o well, then sampling | |
CHAPTER 17. MONTE CARLO METHODS
muc uch h and mixing will b e p oor. One wa way y to resolve this problem is to make h b e a deep represen representation, tation, that enco encodes des x in into to h in such a wa way y that a Mark Marko ov chain in h bsuch m uc h and mixing will b e p oor. One wa y to resolve this problem is to make ea the space of h can mix more easily easily.. Man Many y represen representation tation learning algorithms, deep represen tation, encotend des x to h ainmarginal such a wa y that a Mark ovhcthat hain is in as auto autoencoders encoders and that RBMs, to in yield distribution over the space of can mix more easily . Man y represen tation learning algorithms, such h more uniform and more unimo unimodal dal than the original data distribution over x . It can h that ase auto encoders andarises RBMs, to yield a marginal distribution over is b argued that this fromtend trying to minimize reconstruction error while using more and more unimo dal than theb ecause originalminimizing data distribution over x . Iterror can all of uniform the av available ailable representation space, reconstruction that this arises from trying to ac minimize reconstruction error while using obveerargued the training examples will be b etter achiev hiev hieved ed when differen differentt training examples all of the av ailable representation space, b ecause minimizing reconstruction error are easily distinguishable from each other in h-space, and th thus us well separated. over theettraining examples will be b etter ac edkswhen different training examples Bengio al. (2013a ) observed that deep deeper erhiev stac stacks of regularized auto autoencoders encoders or are easily distinguishable from each other in -space, and th us well separated. h RBMs yield marginal distributions in the top-lev top-level el h-space that app appeared eared more Bengio et al. ( 2013a ) observed that deep er stac ks of regularized auto encoders or spread out and more uniform, with less of a gap b et etw ween the regions corresp corresponding onding h RBMs yield marginal distributions in the top-lev el -space that app eared more to different mo modes des (categories, in the exp experiments). eriments). Training an RBM in that spread outeland more uniform, with less of atogap etweenb et the regions corresp onding higher-lev higher-level space allo allow wed Gibbs sampling mixbfaster etw ween modes. It remains to mo des (categories, exp eriments). Training an RBM that ho how wdifferent ev ever er unclear ho how w to exploit in thisthe observ observation ation to help b etter train and in sample higher-lev space allo eddels. Gibbs sampling to mix faster b etween modes. It remains from deepelgenerativ generative ew mo models. however unclear how to exploit this observation to help b etter train and sample Despite the difficulty of mixing, Monte Carlo techniques are useful and are from deep generative mo dels. often the b est to tool ol available. Indeed, they are the primary to tool ol used to confront Despite the difficulty of mixing, Monte Carlo techniques are next. useful and are the in intractable tractable partition function of undirected mo models, dels, discussed often the b est to ol available. Indeed, they are the primary to ol used to confront the intractable partition function of undirected mo dels, discussed next.
607
Chapter 18 Chapter 18
Confron Confronting ting the Partition Function Confronting the Partition Function In Sec. 16.2.2 we sa saw w that man many y probabilistic mo models dels (commonly kno known wn as undi-
In Sec. 16.2.2 we saw that many probabilistic models (commonly known as undirected graphical mo models) dels) are defined by an unnormalized probabilit probability y distribution In Sec. 16.2.2 w e sa w that man y probabilistic mo dels (commonly kno as undiin order to p˜(x; θ). We must normalize p˜ by dividing by a partition function Z (θ) wn rected graphical models) are defined by an unnormalized probability distribution obtain a valid probability distribution: p˜(x; θ). We must normalize p˜ by dividing by a partition function Z (θ) in order to obtain a valid probability distribution: 1 p(x; θ) = p˜(x; θ). (18.1) Z (θ) 1 p(x; θ) = p˜(x; θ). (18.1) (θ)tin The partition function is an in integral tegral (forZcon contin tinuous uous variables) or sum (for discrete variables) ov over er the unnormalized probabilit probability y of all states: The partition function is an integral (for continuous variables) or sum (for discrete Z variables) over the unnormalized probability of all states: p˜(x)dx (18.2) or or
p˜(x)dx X p˜(x). x
p˜(x). This op operation eration is in intractable tractable for Zmany in interesting teresting mo models. dels.
(18.2)
(18.3) (18.3)
As we will see in Chapter 20, several deep learning mo models dels are designed to This operation is intractable for many interesting models. ha hav ve a tractable normalizing constant, or are designed to be used in wa ways ys that do As w e will see in Chapter 20 , several deep learning mo dels are designed to X not inv involv olv olvee computing p(x) at all. Ho How wev ever, er, other mo models dels directly confront the ve a tractable normalizing constant, or are to beweused in wa ys hniques that do cha hallenge of in intractable tractable partition functions. In designed this chapter, describ describe e tec techniques paluating (x) at all. not inv e computing Ho er, hav other mo dels directly confront the used forolv training and ev evaluating mo models delswev that have e in intractable tractable partition functions. challenge of intractable partition functions. In this chapter, we describe techniques used for training and evaluating models that have intractable partition functions. 608 608
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
18.1
The Log-Lik Log-Likeliho eliho elihoo od Gradien Gradientt
What makes es learning undirected models dels b by y maxim maximum likeliho eliho elihoo od particularly 18.1 mak The Log-Lik elihoomo d Gradien t um lik difficult is that the partition function dep depends ends on the parameters. The gradient of What mak es learning undirected mo dels by maxim likcorresp elihoodonding particularly the log-like log-likeliho liho lihoo od with resp respect ect to the parameters has aum term corresponding to the difficult is that the partition function dep ends on the parameters. The gradient of gradien gradientt of the partition function: the log-likelihood with respect to the parameters has a term corresponding to the gradient of the partition ∇θ log pfunction: (x; θ) = ∇θ log p˜(x; θ) − ∇θ log Z (θ). (18.4) log p(x; θ) = log p˜(x; θ) log Z (θ). (18.4) This is a well-kno well-known wn decomposition in into to the positive phase and ne negative gative phase ∇ ∇ −∇ of learning. This is a well-known decomposition into the positive phase and negative phase For most undirected mo models dels of interest, the negative phase is difficult. Mo Models dels of learning. with no latent variables or with few interactions bet etw ween laten latentt variables typically undirected dels of The interest, the negative phase is delsa ha hav vF e or a most tractable positivemophase. quintessen quintessential tial example of difficult. a mo model del Mo with with no latent v ariables or with few interactions b et w een laten t v ariables t ypically straigh straightforw tforw tforward ard positiv ositivee phase and difficult negativ negativee phase is the RBM, whic which h has ha v e a tractable p ositive phase. The quintessen tial example of a mo del with hidden units that are conditionally indep independent endent from each other giv given en the visiblea straighThe tforwcase ard where positivthe e phase and phase difficult negative with phasecomplicated is the RBM,in whic h has units. positive is difficult, interactions teractions hidden that are conditionally endent from each en the fo visible b et etw weenunits laten latent t variables, is primarilyindep co cov vered in Chapter 19other . Thisgiv chapter focuses cuses units. The case where the p ositive phase is difficult, with complicated in teractions on the difficulties of the negative phase. between latent variables, is primarily covered in Chapter 19. This chapter focuses Let difficulties us lo look ok more closely at thephase. gradient of log Z : on the of the negative ∂ Let us look more closely at the gradient of log Z : log Z (18.5) ∂θ ∂ log (18.5) ∂ Z Z ∂ θ ∂θ (18.6) = Z PZ ∂ (18.6) = p˜(x) = ∂θ Zx (18.7) Z P ∂ p˜(x) = (18.7) Z p˜(x) . = x ∂θ (18.8) Z p˜(x) For mo models dels that guarantee p(x= all x. , we can substitute exp (log(18.8) ) > 0 for p˜(x)) PZ for p˜(x): For models that guarantee P 0 for p(x)∂>exp (logallp˜(xx,))we can substitute exp (log p˜(x)) x ∂θ (18.9) for p˜(x): P Z exp (log p˜(x)) P ∂ (18.9) exp (log Zp˜(x)) ∂θ log p˜(x) = x (18.10) Z exp (log p˜(x)) log p˜(x) ∂ = (18.10) p˜(x) ∂θ Z log p˜(x) = (18.11) P x Z p˜(x) log p˜(x) 609 =P (18.11) Z P
P
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
X
∂ log p˜(x) (18.12) ∂θ x ∂ = p(x) log p˜(x) (18.12) ∂θ = Ex∼p(x) log p˜(x). (18.13) ∂θ ∂ E log p˜er(xdiscrete ). x, but a similar(18.13) This deriv derivation ation made use = of summation over result ∂ θ ov X x applies using in integration tegration ov over er con contin tin tinuous uous . In the con contin tin tinuous uous version of the This deriv made use offor summation over under discrete , but a similar result deriv derivation, ation, weation use Leibniz’s rule differen differentiation tiation thexintegral sign to obtain applies using continuous x. In the continuous version of the the iden identit tit tity y integration over Z for differenZtiation under the integral sign to obtain derivation, we use Leibniz’s∂rule ∂ p˜(x)dx = p˜(x)dx. (18.14) the identity ∂θ ∂θ ∂ ∂ ∂ p˜(x)dcertain x = regularity p˜(x)dxconditions . (18.14) p˜(x). This identit identity y is only applicable under on p˜ and ∂θ ∂θ ∂θ In measure theoretic terms, the conditions are: (i) p˜ must be a Leb Lebesgue-integrable esgue-integrable p˜(x). This identit y is only applicable under certain and almost ∂ regularity conditions on p function of x for every value of θ ; (ii) ∂θ p˜(x) must exist for all θ˜ and p ˜ In measure theoretic terms, the conditions are: (i) m ust b e a Leb esgue-integrable ∂ Z integrable Zfunction R( x) that bounds ∂θ p˜( x) (i.e. all x; (iii) There must exist an x θ p ˜ ( x θ function of for every v alue of ; (ii) ) must exist for all and almost ∂ | ∂θ p˜(x)| ≤ R(x) for all θ and almost all x). Fortunately ortunately,, most machine learning all x; (iii) There must exist an integrable function R( x) that bounds p˜( x) (i.e. mo models dels of in interest terest ha hav ve these prop properties. erties. R(x) for all θ and almost all x). Fortunately, most machine learning p˜(x) This iden identit tit tity y mo | dels |of≤interest have these properties. (18.15) ∇θ log Z = E x∼p(x)∇θ log p˜(x) This identity E metho is the basis for a variet ariety y of Monte Carlo methods ds maximizing log Z = logfor p˜(xapproximately ) (18.15) the lik likeliho eliho elihoo od of mo models dels with intractable partition functions. ∇ Monte Carlo metho ∇ ds for approximately maximizing is the basis for a variety of The Mon Monte te Carlo approach to learning undirected mo models dels provides an in intuitiv tuitiv tuitivee the likelihood of models with intractable partition functions. framew framework ork in whic which h we can think of both the positive phase and the negativ negativee The Mon te Carlo approach to learning undirected mo dels provides an intuitiv phase. In the positive phase, we increase log p˜( x) for x dra drawn wn from the data. Ine framew ork inphase, which can think of both the positive phase and logthe p˜(xnegativ the negative wewe decrease the partition function by decreasing ) drawne phase.the Inmo thedel positive phase, we increase log p˜( x) for x drawn from the data. In from model distribution. the negative phase, we decrease the partition function by decreasing log p˜(x) drawn In the deep learning literature, it is common to parametrize log p˜ in terms of from the model distribution. an energy function (Eq. 16.7). In this case, we can interpret the positiv ositivee phase log p˜ in phase In the deep learning is common to parametrize terms as of as pushing down on the literature, energy of it training examples and the negative an energy (Eq. 16.7 ). In this case,from we can positive in phase pushing upfunction on the energy of samples drawn the interpret mo model, del, asthe illustrated Fig. as pushing down on the energy of training examples and the negative phase as 18.1 18.1.. pushing up on the energy of samples drawn from the model, as illustrated in Fig. 18.1. =
p(x)
18.2
Sto Stocchastic Maxim Maximum um Lik Likeliho eliho elihood od and Con Contrastiv trastiv trastive e Div Divergence ergence 18.2 Stochastic Maximum Likeliho od and Contrastive The naiveDiv wa way yergence of implementing Eq. 18.15 is to compute it by burning in a set of Mark Marko ov chains from a random initialization every time the gradient is needed. The naive way of implementing Eq. 18.15 is to compute it by burning in a set 610 of Markov chains from a random initialization every time the gradient is needed.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
When learning is performed using sto stochastic chastic gradient descen descent, t, this means the chains must be burned in once per gradient step. This approach leads to the When learning is ppresen erformed using stochastic descen means the training pro procedure cedure presented ted in Algorithm 18.1gradient . The high cost t,ofthis burning in the cMarko hainsovmcust beinburned in once per gradient step. approach leads to the Mark hains the inner lo loop op makes this pro procedure cedureThis computationally infeasible, training pro cedure presen ted in Algorithm 18.1 . The high cost of burning in aim the but this pro procedure cedure is the starting point that other more practical algorithms Mark ov cximate. hains in the inner loop makes this procedure computationally infeasible, to appro approximate. but this procedure is the starting point that other more practical algorithms aim to approximate. log-likelihood d Algorithm 18.1 A naive MCMC algorithm for maximizing the log-likelihoo with an in intractable tractable partition function using gradient ascent. Algorithm 18.1 A naive MCMC algorithm for maximizing the log-likelihood Set , the step size, to a small positive num number. ber. with an intractable partition function using gradient ascent. Set k , the num number ber of Gibbs steps, high enough to allow burn in. Perhaps 100 to Set , the step size, to a small train an RBM on a small imagepositive patch. number. k Set , the num ber of Gibbs steps, high enough to allow burn in. Perhaps 100 to while not con conv verged do train an RBM on a small image patch.{x(1) , . . . , x(m)} from the training set. Sample a minibatc minibatch h of m examples P m verged do (i) while 1 con ˜(x ; θ). g ← not i=1 ∇θ log p m Sample a minibatc h of m examples x , (.m. .) , x from the training set. ˜ (1), . . . , x ˜ } to random values (e.g., from Initialize a set of m samples {x log p˜(x ; θ). g } a uniform or normal distribution, or{ possibly a distribution with marginals ˜ ˜ Initialize a set of samples to random v alues (e.g., from m x , . . . , x ← ∇ matc matched hed to the mo model’s del’s marginals). a uniform distribution, { or possibly} a distribution with marginals for i = 1 toorknormal do matc del’s marginals). forhed jP =to1 the to mmo do for ix to gibbs_up k do ˜=(j)1← ˜(j) ). gibbs_update date date(( x for do = 1 to m endjfor ˜ ˜ ). x gibbs_update( x end for P m end for 1 ← g← g− ˜(x ˜(i) ; θ). i=1 ∇θ log p m end for θ ←θ + g. g g log p˜(x ˜ ; θ). end while θ←θ+ − g. ∇ end← while We can view the MCMC approac approach h to maximum likelihoo likelihood d as trying to achiev achievee
balance bet etw weenP two forces, one pushing up on the mo model del distribution where the W e can view the MCMC approac h to maximum likelihoo d as trying e data occurs, and another pushing do down wn on the mo model del distribution where to theachiev mo model del balance boet weenFig. two 18.1 forces, one pushing on theThe model where samples ccur. illustrates this up pro process. cess. twodistribution forces corresp correspond ondthe to data o ccurs, and another pushing do wn on the mo del distribution where the mo del maximizing log p˜ and minimizing log Z . Several appro approximations ximations to the negative samples Fig.Eac 18.1 illustrates this process. can The btewoundersto forces corresp ond to phase areoccur. possible. Each h of these approximations understoo od as making p˜ and Z . Several maximizing minimizing logcheaper appro ximations to the negative the negativ negativee log phase computationally but also making it push down in the phase are p ossible. Eac h of these approximations can b e understo o d as making wrong lo locations. cations. the negative phase computationally cheaper but also making it push down in the Because the negativ negativee phase in inv volv olves es dra drawing wing samples from the mo model’s del’s distriwrong locations. bution, we can think of it as finding points that the mo model del believ elieves es in strongly strongly.. Because the negativ e phase in v olv es dra wing samples from the mo del’s Because the negative phase acts to reduce the probability of those points,distrithey bution, we can think of it as finding p oints that the mo del b eliev es in strongly are generally considered to represent the mo model’s del’s incorrect beliefs ab about out the world.. Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the611 model’s incorrect beliefs about the world.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
The positive phase
The negative phase
pmodel(x) The positive phase pdata(x) p (x)
The negative pphase model (x) pdata (x) p (x)
(x)
p
p(x)
p(x)
p
x
(x)
x
Figure 18.1: The view of Algorithm 18.1 as having a “positive phase” and “negative phase.” (L (Left) eft) In the positive phase, we sample poin oints ts from the data distribution, and push up on Figure 18.1: The view of Algorithm 18.1 as having “positive phase” phase.” their unnormalized probability probability.. This means pointsathat are likely in and the “negative data get pushed (L eft) In the p ositive phase, we sample p oin ts from the data distribution, and push up on up on more. (R (Right) ight) In the negative phase, we sample points from the mo model del distribution, theirpush unnormalized probability . This means points. that likely in the get phase’s pushed and down on their unnormalized probability probability. This are counteracts the data positive (Right) up on more. the negative sample points from the everywhere. model distribution, tendency to just add In a large constan constanttphase, to thewe unnormalize unnormalized d probability When and push down on their unnormalized probability . This counteracts the positive phase’s the data distribution and the mo model del distribution are equal, the p ositive phase has the tendency to just add aup large t tothe thenegativ unnormalize probability When same chance to push at aconstan point as negative e phased has to pusheverywhere. do down. wn. When this datathere distribution andany the gradien mo del tdistribution are equal, the p ositive has the othe ccurs, is no longer gradient (in exp expectation) ectation) and training must phase terminate. same chance to push up at a point as the negative phase has to push down. When this occurs, there is no longer any gradient (in expectation) and training must terminate.
They are frequently referred to in the literature as “hallucinations” or “fantasy particles.” In fact, the negative phase has been prop proposed osed as a possible explanation They are frequently referred to in the literature as “fantasy for dreaming in humans and other animals (Crick and“hallucinations” Mitchison, 1983or ), the idea particles.” themaintains negative aphase has been mo prop as aworld possible b eing that In thefact, brain probabilistic model delosed of the andexplanation follo follows ws the for dreaming humans and other real animals (Crick Mitchison , 1983 ), negativ the ideae gradien gradient t of loginp˜ while exp experiencing eriencing even events ts whileand awak ake e and follows the negative b eing that the brain maintains a probabilistic mo del of the world and follo ws the gradien gradientt of log p˜ to minimize log Z while sleeping and exp experiencing eriencing even events ts sampled log p˜ while gradien t of expThis eriencing events while waklanguage e and follows from the current mo model. del. view real explains muc uch h ofathe usedthe to negativ describ describeee gradient of log anditexp eriencing even ts en sampled p˜ to log negative Z while sleeping algorithms with a pminimize ositive and phase, but has not been prov proven to be from the current mo del. This view explains m uc h of the language used to describ correct with neuroscientific exp experiments. eriments. In mac machine hine learning mo models, dels, it is usuallye algorithms with a p ositive and negative phase, but it has not b een proven to bine necessary to use the positiv ositivee and negativ negativee phase simultaneously simultaneously,, rather than correct with exp eriments. macsleep. hine learning models, it Sec. is usually separate timeneuroscientific perio eriods ds of wak wakefulness efulness and In REM As we will see in 19.5, necessary to use the positiv e and negativ e phasefrom simultaneously rather thanfor in other mac machine hine learning algorithms draw samples the mo model del ,distribution separate time eriods of wakefulnesscould and REM sleep. As account we will see 19.5, other purp purposes osespand suc such h algorithms also pro provide vide an for in theSec. function other mac hine learning algorithms draw samples from the mo del distribution for of dream sleep. other purposes and such algorithms could also provide an account for the function Giv Given en this understanding of the role of the positive and negative phase of of dream sleep. learning, we can attempt to design a less exp expensive ensive alternativ alternativee to Algorithm 18.1. Giv en this understanding of the role of the positive negative ofv The main cost of the naive MCMC algorithm is the cost ofand burning in thephase Marko Markov learning, we can attempt to design a less expensive alternative to Algorithm 18.1. The main cost of the naive MCMC algorithm is the cost of burning in the Markov 612
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
chains from a random initialization at eac each h step. A natural solution is to initialize the Marko Markov v chains from a distribution that is very close to the mo model del distribution, csohains from a random initialization at eac h step. A natural solution is to initialize that the burn in op operation eration do does es not take as many steps. the Markov chains from a distribution that is very close to the model distribution, The the contr ontrastive astive divergenc genc gencee do (CD, or CDCD-k to many indicate CD with k Gibbs steps) so that burn in diver operation es not takek as steps. algorithm initializes the Marko Markov v chain at each step with samples from the data k to indicate The contr(astive genc e (CD, or CDCDted with Gibbs steps) distribution Hin Hinton tondiver , 2000 , 2010 ). This approach is presen presented as kAlgorithm 18.2. algorithm initializes the Marko v chain at each step with samples from the data Obtaining samples from the data distribution is free, because they are already (Hin ton, set. 2000Initially , 2010)., the Thisdata approach is presen ted close as Algorithm 18.2 adistribution vailable in the data Initially, distribution is not to the mo model del. Obtaining samples from the data distribution is free, because they are already distribution, so the negative phase is not very accurate. Fortunately ortunately,, the positive available theaccurately data set. Initially distribution is not close to the mothe del phase caninstill increase, the the data mo model’s del’s probability of the data. After distribution, sohas the had negative is not ery mo accurate. Fortunately the positive p ositiv ositivee phase somephase time to act,vthe model del distribution is ,closer to the phase can still accurately increase the mo del’s probability of the data. After the data distribution, and the negativ negativee phase starts to become accurate. positive phase has had some time to act, the model distribution is closer to the data distribution, and the negativ phase starts to become accurate. con contrastiv trastiv trastive e ediv divergence ergence algorithm, using gradien gradientt ascen ascentt as Algorithm 18.2 The the optimization pro procedure. cedure. Algorithm 18.2 The contrastive divergence algorithm, using gradient ascent as Set , the step size, to a small positive num number. ber. the optimization procedure. Set k, the num numb ber of Gibbs steps, high enough to allo allow w a Mark Markoov chain sampling Set ,pthe a small positive num ber.. Perhaps 1-20 to train an RBM (x; θstep p data from ) to size, mix to when initialized from Setak,small the num ber patch. of Gibbs steps, high enough to allow a Markov chain sampling on image x; θcon from p(not ) toverged mix when while conv do initialized from p . Perhaps 1-20 to train an RBM onSample a smallaimage patch. minibatc minibatch h of m examples {x(1) , . . . , x(m)} from the training set. Pm while not v erged 1 con g ← m i=1 ∇θ logdo p˜(x(i) ; θ). Sample minibatc from the training set. for i = 1a to m do h of m examples x , . . . , x g x˜ (i) ← x (i). log p˜(x ; θ). { } for do i = 1 to m ← for ∇ end ˜ x . x for i = 1 to k do end forfor j←= 1 to m do Pto k do for ix ˜=(j)1← ˜(j) ). gibbs_up gibbs_update date date(( x for do j = 1 to m end for ˜for ˜ ). gibbs_update( x end x P m end for 1 ← ˜(x ˜(i) ; θ). g← g− i=1 ∇θ log p m end for θ ←θ + g. log p˜(x ˜ ; θ). g g end while θ←θ+ − g. ∇ end← while Of course, CD is still an appro approximation ximation to the correct negativ negativee phase. The
main wa way y that P CD qualitatively fails to implement the correct negative phase Of course, CD is still anregions approximation to the correct phase.actual The is that it fails to suppress of high probability thatnegativ are fare from main wa y that CD qualitatively fails to implement the correct negative phase training examples. These regions that hav havee high probabilit probability y under the mo model del but is that it fails to suppress regions of high probability that are far from actual lo probabilit under the data generating distribution are called spurious mo low w probability y modes des des.. training examples. These regions that have high probability under the model but low probability under the data generating 613 distribution are called spurious modes.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
A spurious mo mode de
p(x)
A spurious mode
pmodel (x) pdata (x) p (x) p (x)
x
Figure 18.2: An illustration of how the negativ negativee phase of con contrastiv trastiv trastivee divergence (Algorithm 18.2 18.2)) can fail to suppress spurious mo modes. des. A spurious mo mode de is a mo mode de that is present in Figure 18.2: An illustration of how the negativ e phase of con trastiv e divergence the mo model del distribution but absent in the data distribution. Because contrastiv contrastivee(Algorithm divergence 18.2 ) can fail to suppress spurious mo des. A spurious mo de is a mo dev that present initializes its Marko Markov v chains from data p oints and runs the Marko Markov chainis for only in a the mo del distribution but in des the in data contrastiv e divergence few steps, it is unlikely toabsent visit mo modes thedistribution. mo model del thatBecause are far from the data p oints. initializes Marko v chains from data oints runssometimes the Markoget v csamples hain for that onlydo a This meansitsthat when sampling from thepmo model, del,and we will few steps, it is unlikely to visit mo des in the mo del that are far from the data p oints. not resemble the data. It also means that due to wasting some of its probability mass This means that sampling from the to moplace del, we willprobability sometimes mass get samples do on these mo modes, des, when the mo model del will struggle high on the that correct not resemble thepurp data. means that due to wuses asting some of itssimplified probability mass mo modes. des. For the purpose ose Itofalso visualization, this figure a somewhat concept on distance—the these mo des, the mo delmode will struggle to the place high probability the of spurious is far from correct mo mode de alongmass the non um umb b er correct line in mo or the purp visualization, figure uses a somewhat simplified concept x . des. ThisFcorresp corresponds ondsose to aof Marko Markov v chain this based on making lo local cal mov moves es with a single R of distance—the spurious mode is far from the correct mo de along the n um b er line in v ariable in . F or most deep probabilistic mo models, dels, the Marko Markov v c hains are based on Gibbs R R x . This corresp onds toe non-lo a Marko chain on making lo cal but movcannot es withmo av single sampling and mak make non-local calv mov moves es based of individual variables mov e all of R can variable in . sim Forultaneously most deep. probabilistic models, itthe areconsider based onthe Gibbs the variables simultaneously ultaneously. For these problems, is Marko usuallyv bchains etter to edit samplingband canmodes, make non-lo mov esEuclidean of individual variables but move all distance etw etween een rathercal than the distance. How Howev ev ever, er,cannot edit distance inofa the variables simultaneously . For these problems, it isplot. usually better to consider the edit high dimensional space is difficult to depict in a 2-D distance between modes, rather than the Euclidean distance. However, edit distance in a high dimensional space is difficult to depict in a 2-D plot.
Fig. 18.2 illustrates why this happens. Essen Essentially tially tially,, it is because modes in the mo model del distribution that are far from the data distribution will not be visited by Fig. illustrates why at this happens. Essen tiallyk, isit very is because Mark Marko o18.2 v chains initialized training points, unless large. modes in the model distribution that are far from the data distribution will not be visited by Carreira-P Carreira-Perpiñan andatHin Hinton ton (2005 ) sho show wed kexp experimen erimentally tally that the CD Mark ov chainserpiñan initialized training points, unless is erimen very large. estimator is biased for RBMs and fully visible Boltzmann machines, in that it erpiñan tonthe (2005 ) showed experimen tally that theargue CD con conv vCarreira-P erges to differen different t pand ointsHin than maximum lik likeliho eliho elihoo od estimator. They estimator is biased for RBMs and fully visible Boltzmann machines, in that it that because the bias is small, CD could be used as an inexp inexpensiv ensiv ensivee way to initialize verges to differen t points than the via maximum likensive elihooMCMC d estimator. They argue acon mo model del that could later be fine-tuned more exp expensive metho methods. ds. Bengio that Delalleau because the bias is small, CD CD could as an inexp ensive waythe to initialize and (2009 ) sho show wed that canbebused e in interpreted terpreted as discarding smallest aterms model that could later b e fine-tuned via more exp ensive MCMC metho ds. Bengio of the correct MCMC up update date gradient, which explains the bias. and Delalleau (2009) showed that CD can be interpreted as discarding the smallest CDofisthe useful for MCMC trainingup shallo shallow mo models dels like RBMs. These can in turn be terms correct datewgradient, which explains the bias. 614dels like RBMs. These can in turn b e CD is useful for training shallow mo
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
stac stack ked to initialize deep deeper er mo models dels lik likee DBNs or DBMs. Ho Howev wev wever, er, CD do does es not pro provide vide muc uch h help for training deep deeper er mo models dels directly directly.. This is because it is difficult stac ked tosamples initialize er models e DBNs or DBMs. wever, CD Since does not to obtain of deep the hidden unitslikgiv given en samples of the Ho visible units. the pro vide m uc h help for training deep er mo dels directly . This is b ecause it is difficult hidden units are not included in the data, initializing from training points cannot to obtain samples ofEven the ifhidden units giv envisible samples of the units.weSince the solv solve e the problem. we initialize the units fromvisible the data, will still hidden included in the data, initializing from training points cannot need tounits burnare in anot Marko Markov v chain sampling from the distribution ov over er the hidden solve the problem. on Even if we initialize the visible units from the data, we will still units conditioned those visible samples. need to burn in a Markov chain sampling from the distribution over the hidden The CD algorithm can be thought of as penalizing the mo model del for having a units conditioned on those visible samples. Mark Marko ov chain that changes the input rapidly when the input comes from the data. CD training algorithmwith canCD be thought of resembles as penalizing the mo for having a ThisThe means somewhat auto autoenco enco encoder derdel training. Even Markov CD chainis that input rapidly input comes from data. though morechanges biased the than some of thewhen otherthe training metho methods, ds, itthe can be This means training with CD somewhat resembles auto enco der training. Even useful for pretraining shallow mo models dels that will later be stack stacked. ed. This is because though CD is more biased than some of the other training ds, it can the earliest mo models dels in the stack are encouraged to copy moremetho information up btoe usefullatent for pretraining shallow models itthat will later be later stackmo ed.dels. ThisThis is because their variables, thereb thereby y making av available ailable to the models. should the earliest mo dels in the stack are encouraged to copy more information to be thought of more of as an often-exploitable side effect of CD training ratherup than latent vdesign ariables, thereb y making it available to the later models. This should atheir principled adv advantage. antage. be thought of more of as an often-exploitable side effect of CD training rather than Sutsk Sutskev ev ever er and Tieleman (2010) show showed ed that the CD up update date direction is not the a principled design advantage. gradien gradientt of an any y function. This allows for situations where CD could cycle forever, ever and Tieleman 2010) show ed that the CD update direction is not the butSutsk in practice this is not a (serious problem. gradient of any function. This allows for situations where CD could cycle forever, A differen differentt strategy that resolves many of the problems with CD is to initialize but in practice this is not a serious problem. the Mark Markoov chains at each gradient step with their states from the previous gradien gradientt A differen t strategy resolves many the problems CD is tomaximum initialize step. This approac approach h wthat as first disco discov vered of under the namewith sto stochastic chastic the Mark chainsinatthe eachapplied gradient step with their from communit the previous likeliho likelihoo odov (SML) mathematics andstates statistics community y (gradien Younes ounes,t, step. Thislater approac h enden was first discovvered chastic cmaximum 1998 1998)) and indep independen endently tly redisco rediscov ered under under the the name name psto ersistent ontr ontrastive astive likeliho o d (SML) in the applied mathematics and statistics communit y ( Y ounes diver divergenc genc gencee (PCD, or PCDPCD-k update) date) in, k to indicate the use of k Gibbs steps per up 1998deep ) andlearning later indep endently vered under the Algorithm name persistent ontrastive the communit community y (redisco Tieleman , 2008 ). See 18.3. cThe basic diver genc e (PCD, or PCDto indicate the use of Gibbs steps p er up date) int k k idea of this approach is that, so long as the steps tak taken en by the sto stochastic chastic gradien gradient the deep learning communit (Tieleman 2008previous ). See Algorithm . Thetobasic algorithm are small, then they mo model del from, the step will b18.3 e similar the idea of this approach is that, so long as the steps tak en by the sto chastic gradien mo model del from the current step. It follo follows ws that the samples from the previous model’st algorithm arewill small, from fair the samples previous from step will e similar todel’s the distribution be then very the closemotodelbeing the bcurren current t mo model’s model from the It initialized follows that thethese samples fromwill the not previous model’s distribution, so acurrent Mark Markoostep. v chain with samples require muc uch h distribution will b e v ery close to b eing fair samples from the curren t mo del’s time to mix. distribution, so a Markov chain initialized with these samples will not require much Because each Marko Markov v chain is contin continually ually up updated dated throughout the learning time to mix. pro process, cess, rather than restarted at eac each h gradient step, the chains are free to wander each Marko v chain isdel’s contin ually datedis throughout the learning far Because enough to find all of the mo model’s mo modes. des. up SML th thus us considerably more pro cess, rather than restarted at eac h gradient step, the c hains are free to wander resistan resistantt to forming mo models dels with spurious mo modes des than CD is. Moreov Moreover, er, because far enough to find all of the mo del’s mo des. SML is th us considerably more it is possible to store the state of all of the sampled variables, whether visible or resistan t to forming with spurious than is. Moreov er, because laten latent, t, SML pro provides videsmo andels initialization pointmo fordes both theCD hidden and visible units. it is possible to store the state of all of the sampled variables, whether visible or latent, SML provides an initialization p615 oint for both the hidden and visible units.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
CD is only able to provide an initialization for the visible units, and therefore requires burn-in for deep mo models. dels. SML is able to train deep mo models dels efficiently efficiently.. CD is only able to provide an initialization for the visible units, and therefore Marlin et al. (2010) compared SML to many of the other criteria presented in requires burn-in for deep mo dels. SML is able to train deep mo dels efficiently this chapter. They found that SML results in the best test set log-likelihoo log-likelihood d for. Marlin et al. ( 2010 ) compared SML to many of the other criteria presented in an RBM, and that if the RBM’s hidden units are used as features for an SVM this chapter. found that SML results in the best test set log-likelihood for classifier, SMLThey results in the best classification accuracy accuracy. . an RBM, and that if the RBM’s hidden units are used as features for an SVM SML is vulnerable to becoming inaccurate if the sto stochastic chastic gradient algorithm classifier, SML results in the best classification accuracy. can mo mov ve the mo model del faster than the Marko Markov v chain can mix bet etw ween steps. This SML is vulnerable to b ecoming inaccurate if the sto chastic gradient can happ happen en if k is to too o small or is to too o large. The permissible range ofalgorithm values is can move the highly model problem-dep faster than the Marko v chain can mix bwa etyween steps. This unfortunately problem-dependen enden endent. t. There is no known way to test formally can happthe en chain if k isistosuccessfully o small or mixing is too blarge. permissible range of vlearning alues is whether etw etween eenThe steps. Sub Subjectiv jectiv jectively ely ely,, if the unfortunately highly problem-dep t. steps, There the is nohuman knownop wa y to test rate is to too o high for the num umb ber ofenden Gibbs operator erator willformally be able whether the chain is successfully mixing b etw een steps. Sub jectiv ely , if the learning to observ observee that there is muc much h more variance in the negative phase samples across rate is to o highrather for the num ber ofdifferen Gibbs tsteps, the will abmo e able gradien gradient t steps than across different Mark Marko ov chuman hains. Fop orerator example, model del to observ e that there is muc h more v ariance in the negative phase samples across trained on MNIST might sample exclusively 7s on one step. The learning pro process cess gradien t steps rather than across differen t Mark o v c hains. F or example, a mo del will then push down strongly on the mo mode de corresp corresponding onding to 7s, and the mo model del trained on MNIST might sample exclusively 7s on one step. The learning pro cess migh mightt sample exclusiv exclusively ely 9s on the next step. will then push down strongly on the mode corresponding to 7s, and the model might sample18.3 exclusiv 9s on themaximum next step. likelihoo Theely sto stochastic chastic likelihood d / persistent con contrastiv trastiv trastivee Algorithm div divergence ergence algorithm using gradien gradientt ascen ascentt as the optimization pro procedure. cedure. Algorithm 18.3 The stochastic maximum likelihood / persistent contrastive Set , the step size, to a small positive num number. ber. divergence algorithm using gradient ascent as the optimization procedure. Set k, the num numb ber of Gibbs steps, high enough to allo allow w a Mark Markoov chain sampling Set ,pthe a small positivefrom numsamples ber. (x; θstep + gsize, from ) to to burn in, starting from p(x; θ). Perhaps 1 for Set k, on theanum berimage of Gibbs steps, highfor enough to complicated allow a Markmodel ov chain RBM small patc patch, h, or 5-50 a more likesampling a DBM. p ( x ; θ + g p ( x ; θ from ) to burn in, starting from samples from ) . Perhaps 1 fora (1) ( m ) ˜ , . . . , x˜ } to random values (e.g., from Initialize a set of m samples {x RBM on a small image patc h, or 5-50 for a more complicated model like a DBM. uniform or normal distribution, or possibly a distribution with marginals matched ˜ , . . . , x˜ Initialize adel’s set marginals). of m samples x to random values (e.g., from a to the mo model’s uniformnot or normal distribution, with marginals matched {or possibly a distribution } while con conv verged do to Sample the model’s marginals). a minibatc minibatch h of m examples {x(1) , . . . , x(m)} from the training set. m verged do (i) while not con 1 P g ← m i=1 ∇θ log p˜(x ; θ). Sample minibatc from the training set. for i = 1a to k do h of m examples x , . . . , x g for j = 1 to mlog { } dop˜(x ; θ). for to gibbs_up k∇do ←ix ˜=(j)1← ˜(j) ). gibbs_update date date(( x for do j = 1 to m end for ˜for ˜ ). gibbs_update( x end x P P m end for 1 ← ˜(x ˜(i) ; θ). g← g− i=1 ∇θ log p m end for θ ←θ + g. log p˜(x ˜ ; θ). g g end while θ←θ+ − g. ∇ end← while P
616
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Care must be taken when ev evaluating aluating the samples from a model trained with SML. It is necessary to dra draw w the samples starting from a fresh Marko Markov v chain Care m ust b e taken when ev aluating the samples from a model trained initialized from a random starting point after the mo model del is done training. with The SML. It is necessary to dra w the samples starting from a fresh Marko v chain samples present in the persistent negativ negativee chains used for training hav havee been initialized from a random starting p oint after the mo del is done training. The influenced by sev several eral recen recentt versions of the model, and thus can mak makee the mo model del samples in thecapacit persistent negativ e chains used for training have been app appear ear topresent ha hav ve greater capacity y than it actually do does. es. influenced by several recent versions of the model, and thus can make the model Berglund and Raiko (2013) performed exp experimen erimen eriments ts to examine the bias and appear to have greater capacity than it actually does. variance in the estimate of the gradien gradientt pro provided vided by CD and SML. CD pro prov ves to Berglund and Raiko ( 2013 ) p erformed exp erimen ts to examine the bias and ha hav ve lo low wer variance than the estimator based on exact sampling. SML has higher variance. ariance inThe the cause estimate of thelo gradien t prois vided by CD andsame SML.training CD propvoin es to v of CD’s low w variance its use of the oints ts ha v e lo w er v ariance than the estimator based on exact sampling. SML has higher in both the positive and negative phase. If the negative phase is initialized from v ariance. The cause of CD’s w variance its ofofthe training poinon ts differen different t training points, the vlo ariance rises is ab abov ov oveeuse that thesame estimator based in b oth the positive and negative phase. If the negative phase is initialized from exact sampling. different training points, the variance rises above that of the estimator based on All of these metho methods ds based on using MCMC to draw samples from the mo model del exact sampling. can in principle be used with almost an any y varian ariantt of MCMC. This means that All of these metho ds based on using MCMC to draw samples from theMCMC model tec techniques hniques suc such h as SML can be improv improved ed by using any of the enhanced can in principle bed e used with almost any ariant oftemp MCMC. meansetthat tec techniques hniques describ described in Chapter 17, such asvparallel tempering ering (This Desjardins al., techniques h, as SML 2010 ; Cho etsuc al. al., 2010 ). can be improved by using any of the enhanced MCMC techniques described in Chapter 17, such as parallel tempering (Desjardins et al., One approach to accelerating mixing during learning relies not on changing the 2010 ; Cho et al., 2010 ). Mon Monte te Carlo sampling tec technology hnology but rather on changing the parametrization of One approach to accelerating during learning relies notand on Hin changing the) the mo model del and the cost function. mixing Fast PCD or FPCD (Tieleman Hinton ton, 2009 Mon teesCarlo sampling technology θbut on changing in inv volv olves replacing the parameters of arather traditional mo model del the withparametrization an expression of the model and the cost function. Fast PCD or FPCD (Tieleman and Hinton, 2009) slow involves replacing the parameters a )traditional θ =θθ(of + θ (fast). model with an expression (18.16) θ=θ + . and they are added together (18.16) There are no now w twice as man many y parameters asθbefore, elemen element-wise t-wise to provide the parameters used by the original mo model del definition. The There are nothe w twice as manis y parameters efore, and they are added together fast cop copy y of parameters trained withasabmuc much h larger learning rate, allowing elemen t-wiserapidly to provide the onse parameters by ethe original model definition. it to adapt in resp response to the used negativ negative phase of learning and push The the fast cop y of the parameters is trained with a muc h larger learning rate, allowing Mark Markoov chain to new territory territory.. This forces the Marko Markov v chain to mix rapidly rapidly,, though it to adapt rapidly in resp onse to the negativ e phase of learning andtopush the this effect only occurs during learning while the fast weigh eights ts are free change. Mark ov chain new territory . Thistforces the Marko chain mix rapidly , though T ypically one to also applies significan significant weigh eight t decay tovthe fasttoweigh weights, ts, encouraging this effect only during learning while the fast eightson arelarge freevto change. them to conv converge ergeoccurs to small values, after only transien transiently tly wtaking alues long T ypically one also applies significan t w eigh t decay to the fast weigh ts, encouraging enough to encourage the Marko Markov v chain to change mo modes. des. them to converge to small values, after only transiently taking on large values long One key benefit to the MCMC-based metho methods ds describ described ed in this section is that enough to encourage the Markov chain to change modes. they pro provide vide an estimate of the gradien gradientt of log Z , and th thus us we can essen essentially tially One k ey b enefit to the MCMC-based metho ds describ ed in this section is that decomp decompose ose the problem into the log p˜ con contribution tribution and the log Z con contribution. tribution. log Z they prothen vide use an estimate the gradien t of log , )and thjust us we can tially p˜(x W e can an any y otherofmetho method d to tackle , and add ouressen negative log p ˜ log Z decomp ose the problem into the con tribution and the con tribution. phase gradien gradientt onto the other metho method’s d’s gradien gradient. t. In particular, this means that We can then use any other method to tackle log p˜(x ), and just add our negative phase gradient onto the other method’s617gradient. In particular, this means that
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
our positiv ositivee phase can make use of metho methods ds that provide only a low lower er bound on p˜. Most of the other metho methods ds of dealing with log Z presen presented ted in this chapter are our positive phase make usepof metho ds that provide incompatible with bcan ound-based ositiv ositive e phase metho methods. ds. only a lower bound on p˜. Most of the other methods of dealing with log Z presented in this chapter are incompatible with bound-based positive phase methods.
18.3
Pseudolik Pseudolikeliho eliho elihoo od
Mon Monte approximations 18.3te Carlo Pseudolik elihootodthe partition function and its gradient directly confron confrontt the partition function. Other approaches sidestep the issue, by training Mon te del Carlo approximations to the partition function and its gradient directly the mo model without computing the partition function. Most of these approaches are confron t the partition function. Other approaches sidestep the issue, by training based on the observ observation ation that it is easy to compute ratios of probabilities in an the model without computing theThis partition function. Most of these approaches undirected probabilistic mo model. del. is because the partition function app appears earsare in based on the observ ation that it is easy to compute ratios of probabilities in an both the numerator and the denominator of the ratio and cancels out: undirected probabilistic model. This is because the partition function appears in 1 p˜(xof ) the p(x) p˜(ratio x) and cancels out: both the numerator and the denominator = Z = . (18.17) 1 p(y) p ˜ ( y ) p ˜ ( y ) Zp ˜(x) p(x) p˜(x) = = . (18.17) p ( y ) p ˜ ( y ) p ˜ ( y ) The pseudolik pseudolikeliho eliho elihoood is based on the observ observation ation that conditional probabilities tak takee this ratio-based form, and thus can be computed without knowledge of the The pseudolik odose is based the observ that conditional xation c, whereprobabilities a con partition function.eliho Supp Suppose that won e partition in into to a , b and contains tains tak e this ratio-based form, and thus can b e computed without knowledge of the the variables we wan antt to find the conditional distribution ov over, er, b con contains tains the x into a,vbariables a con partition Supp ose thaton, weand partition and c, that whereare c con v ariables function. we wan wantt to condition contains tains the nottains part b the v ariables w e w an t to find the conditional distribution ov er, con tains the of our query query.. variables we want to condition on, and c contains the variables that are not part p(a, b) p(a, b) p˜(a, b) of our query. = P =P p(a | b) = . (18.18) p(b) ˜(a, b, c) a,c p(a, b, c) a,c p p(a, b) p(a, b) p˜(a, b) = = p (a b) = . (18.18) (b) pout (a,ab, ,which c) b, c)efficient op This quantit quantity y requires pmarginalizing can bp˜e(a,very operation eration | pro provided vided that a and c do not contain very man many y variables. In the extreme case, a a,ywhich Thisbquantit y requires canthis be aop very efficient operation can e a single variablemarginalizing and c can be out empt empty , making operation eration require only as a c pro vided that and do not contain very man y v ariables. In the extreme case, a man many y ev evaluations aluations of p˜ as there are values of a single random variable. can be a single variable and c can P be empty, making P this operation require only as Unfortunately Unfortunately,, in order to compute the log-likelihoo log-likelihood, d, we need to marginalize many evaluations of p˜ as there are values of a single random variable. out large sets of variables. If there are n variables total, we must marginalize a set Unfortunately in order compute the log-likelihoo d, we need to marginalize of size n − 1. By ,the chain to rule of probability probability, , out large sets of variables. If there are n variables total, we must marginalize a set of size nlog1p.(xBy ofpprobability ) =the logchain p(x 1 )rule + log (x2 | x1 ) +,· · · + p(xn | x1:n−1 ). (18.19) − x ) + but+c pcan (x bexas large log pwe (x)ha =vlog p(x a) + log p(x small, ). as x2:n .(18.19) In this case, hav e made maximally What c b | · · · | if we simply mov in to reduce the computational cost? This yields the movee into to a c x In this case, we ha v e made maximally small, but can b e as large as . What pseudolikeliho pseudolikelihoo od (Besag, 1975) ob objectiv jectiv jectivee function, based on predicting the value of if we simply move c into b to reduce the computational cost? This yields the pseudolikelihood (Besag, 1975) ob jective618 function, based on predicting the value of
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
feature xi giv given en all of the other features x−i: n feature x given all of the other X features x : log p(x i | x −i).
(18.20)
i=1
log p(x x ). (18.20) If eac each h random variable has k differen differentt values, this requires only k ×n ev evaluations aluations | n of p˜ to compute, as opp opposed osed to the k ev evaluations aluations needed to compute the partition k If eac h random v ariable has differen t values, this requires only k n evaluations function. Xk evaluations needed to compute of p˜ to compute, as opp osed to the × the partition This may lo look ok like an unprincipled hack, but it can be prov proven en that estimation function. by maximizing the pseudolik pseudolikeliho eliho elihoo od is asymptotically consistent (Mase, 1995). This may lo ok like an unprincipled butapproach it can be the provlarge en that estimation Of course, in the case of datasets thathack, do not sample limit, b y maximizing the pseudolik eliho o d is asymptotically consistent ( Mase , 1995 pseudolik pseudolikeliho eliho elihoo od ma may y display different behavior from the maximum likelihoo likelihood d). Of course, in the case of datasets that do not approach the large sample limit, estimator. pseudolikelihood may display different behavior from the maximum likelihood It is possible to trade computational complexity for deviation from maximum estimator. lik likeliho eliho elihoood behavior by using the gener generalize alize alized d pseudolikeliho pseudolikelihoo od estimator (Huang and It is p ossible to trade computational complexity for deviation from maximum Ogata, 2002). The generalized pseudolik pseudolikeliho eliho elihoo od estimator uses m differen different t sets lik d, b y usingofthe generalize d pseudolikeliho od estimator (ieliho ), i =o1 . .ehavior . , m of bindices S variables that app appear ear together on the left(Huang side of and the m Ogata , 2002 ). The generalized pseudolik eliho o d estimator uses differen t sets (1) conditioning bar. In the extreme case of m = 1 and S = 1, . . . , n the generalized S , i = 1 , . . . ,m indices of variables thatd. app together on the left side of the pseudolik pseudolikeliho eliho elihoo od of recov recovers ers the log-likelihoo log-likelihood. Inear the S extreme case of m = n and m = 1pseudolikelihoo , . . . , n the generalized conditioning bar. In the extreme case of = 1 and pseudolikelihood d reco recov vers the pseudolikelihood. d. The S(i) = {i}, the generalized pseudolikelihoo m = n and pseudolik eliho o d recov ers the log-likelihoo d. In the extreme case of generalized pseudolik d ob function is given b y pseudolikeliho eliho elihoo o objectiv jectiv jectivee S = i , the generalized pseudolikelihood recovers the pseudolikelihood. The m generalized ob jective function is given by { } pseudolikelihood X log p(xS(i) | x−S (i)). (18.21) i=1
log p(x x ). (18.21) The performance of pseudolikelihoo pseudolikelihood-based d-based depends ends largely on how | approaches dep the mo model del will be used. Pseudolikelihoo Pseudolikelihood d tends to perform poorly on tasks that The apgo erformance of pseudolikelihoo d-based approaches depends largely on how p ( x require d mo of the full joint ) , suc densit and sampling. gooo model del such h as density y estimation the delitwill e used. better Pseudolikelihoo d um tends to perform poorly onrequire tasks that X Ho How wmo ev ever, er, can bperform than maxim maximum likelihoo likelihood d for tasks that only p ( x require a go o d mo del of the full joint ) , suc h as densit y estimation and sampling. the conditional distributions used during training, such as filling in small amounts Homissing wever, itvalues. can perform betterpseudolikelihoo than maximumdlikelihoo d for that require only of Generalized pseudolikelihood tec techniques hniques aretasks esp especially ecially pow owerful erful if the conditional distributions used during training, such as filling in small amounts S the data has regular structure that allows the index sets to be designed to capture of missing values. Generalized pseudolikelihoo d techniques are especially powerful if the most imp important ortant correlations while leaving S out groups of variables that only thevedata has regular structure that allows the index images, sets to be designed ha hav negligible correlation. For example, in natural pixels that to arecapture widely the most imp ortant correlations while leaving out groups of v ariables that only separated in space also hav havee weak correlation, so the generalized pseudolikelihoo pseudolikelihood d ha v e negligible correlation. F or example, in natural images, pixels that are widely can be applied with each S set being a small, spatially lo localized calized window. separated in space also have weak correlation, so the generalized pseudolikelihood S One weakness of the pseudolikelihoo pseudolikelihood d estimator is that it cannot be used with can be applied with each set being a small, spatially localized window. other appro approximations ximations that pro provide vide only a low lower er bound on p˜(x), suc such h as variational One w eakness of the pseudolikelihoo d estimator is that it cannot beears usedinwith inference, whic which h will be cov covered ered in Chapter 19. This is because p˜ app appears the other approximations that provide only a lower bound on p˜(x), such as variational 619 19. This is because p ˜ appears in the inference, which will be covered in Chapter
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
denominator. A lo low wer bound on the denominator provides only an upp upper er bound on the expression as a whole, and there is no benefit to maximizing an upp upper er bound. denominator. A lo w er b ound on the denominator provides only an upp er b ound on This mak makes es it difficult to apply pseudolikelihoo pseudolikelihood d approaches to deep mo models dels such thedeep expression as a whole, and since therevis no benefit to maximizing uppdominant er bound. as Boltzmann mac machines, hines, ariational metho methods ds are one an of the This mak es it difficult to apply pseudolikelihoo d approaches to deep mo such approac approaches hes to approximately marginalizing out the man many y lay layers ers of hidden dels variables as deep Boltzmann machines, variational methods are the dominant that interact with each other. since Ho How wev ever, er, pseudolikelihoo pseudolikelihood d isone stillofuseful for deep approac hes to approximately marginalizing out the man y lay ers of hidden v ariables learning, because it can be used to train single lay layer er mo models, dels, or deep mo models dels using that interact with each other. Ho w ev er, pseudolikelihoo d is still useful for deep appro approximate ximate inference metho methods ds that are not based on low lower er b ounds. learning, because it can be used to train single layer models, or deep models using Pseudolik Pseudolikeliho eliho elihoo od has a muc much h greater cost per gradient step than SML, due to approximate inference methods that are not based on lower b ounds. its explicit computation of all of the conditionals. How However, ever, generalized pseudoPseudolik eliho o d has a muc h greater cost p er gradient SML,selected due to lik likeliho eliho elihoo od and similar criteria can still perform well if onlystep one than randomly its explicit computation of the (conditionals. pseudoconditional is computed pof erall example Go Goo odfellow etHow al., ever, 2013bgeneralized ), thereb thereby y bringing lik eliho o d and similar criteria can still p erform w ell if only one randomly selected the computational cost do down wn to match that of SML. conditional is computed per example (Goodfellow et al., 2013b), thereby bringing Though the pseudolik pseudolikelihoo elihoo elihood d estimator do does es not explicitly minimize log Z , it the computational cost down to match that of SML. can still be though thoughtt of as having something resem resembling bling a negative phase. The Though the pseudolik elihoo d estimator do es not minimizealgorithm log Z , it denominators of eac each h conditional distribution resultexplicitly in the learning can still be the though t of as having something resem bling phase. from The suppressing probability of all states that hav have e only onea vnegative ariable differing of each conditional distribution result in the learning algorithm adenominators training example. suppressing the probability of all states that have only one variable differing from See Marlin and de Freitas (2011) for a theoretical analysis of the asymptotic a training example. efficiency of pseudolik pseudolikeliho eliho elihoo od. See Marlin and de Freitas (2011) for a theoretical analysis of the asymptotic efficiency of pseudolikelihood.
18.4
Score Matc Matching hing and Ratio Matc Matching hing
Score , 2005)and provides another consisten consistent t means of training a 18.4matching Score(Hyvärinen Matching Ratio Matc hing Z mo model del without estimating or its deriv derivativ ativ atives. es. The name score matching comes Score terminology matching (Hyvärinen provides meansrespect of training from in whic which h, 2005 the )deriv derivatives ativesanother of a logconsisten densityt with to itsa Z or its mo del without estimating es. strategy The name score matching comes p( x), are called argumen argument, t, ∇ x log itsderiv sc scor or oreeativ . The used by score matching is from terminology in whic h the deriv atives of a log density with respect to its to minimize the exp expected ected squared difference bet etw ween the deriv derivatives atives of the mo model’s del’s log p ( x argumen t, ) , are called its sc or e . The strategy used by score matching is log densit with resp to the input and the deriv of the data’s log density density y respect ect derivatives atives to minimize expected ∇the with resp respect ect to the input:squared difference between the derivatives of the model’s log density with respect to the input and the derivatives of the data’s log density 1 with respect toLthe (x, θinput: ) = ||∇ x log pmodel (x; θ) − ∇ x log p data(x)||22 (18.22) 2 11 L(x p (x, θ()x; θ) log p (x) (18.22) J ,(θ θ)) = = 2 E pdatalog L (18.23) 2 ||∇ (x) −∇ || 1E ∗ (18.24) J (θ ) = min J (θ) L(x, θ) (18.23) 2θ θ = minav Joids (θ) the difficulties asso (18.24) This ob objectiv jectiv jectivee function avoids associated ciated with differentiating the partition function Z because Z is not a function of x and therefore ∇xZ = 00.. This ob jective function avoids the difficulties associated with differentiating 620 a function of x and therefore Z = 0. the partition function Z because Z is not ∇
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Initially Initially,, score matching app appears ears to hav havee a new difficulty: computing the score of the data distribution requires knowledge of the true distribution generating Initially , score matching ears to, hav e a new the difficulty: computing the training data, pdata . Fapp ortunately ortunately, minimizing exp expected ected value of Lthe (x, score θ ) is of the data distribution requires knowledge of the true distribution generating equiv equivalent alent to minimizing the exp expected ected value of the training data, p . Fortunately, minimizing the expected value of L(x, θ ) is of 2 ! n 2 equivalent to minimizing the expected value X ∂ 1 ∂ ˜ (18.25) L(x, θ) = log pmodel (x; θ) + log p model (x; θ) 2 ∂ xj ∂ x2j j=1 ∂ 1 ∂ ˜ (18.25) L(x, θ) = log p (x; θ) + log p (x; θ) 2 ∂x ∂ x where n is the dimensionality of x. Because score matching requires derivatives atives with resp respect ect to x, it is not where n is the dimensionality of x. taking deriv applicable to mo models dels of discrete data. Ho How wev ever, latentt variables in the mo model del er, the laten ! X matching requires taking deriv x Because score atives with resp ect to , it is not ma may y be discrete. applicable to models of discrete data. However, the latent variables in the model Lik Likee the pseudolikelihoo pseudolikelihood, d, score matc matching hing only works when we are able to may be discrete. log p ˜ ( x ev evaluate aluate ) and its deriv derivativ ativ atives es directly directly.. It is not compatible with methods e the pseudolikelihoo d, score works when we arerequires able to log p˜hing ( x), bonly thatLik only provide a low lower er bound onmatc ecause score matching log p˜( xand evaluate ) and its deriv ativ es es directly notacompatible with methods the deriv derivatives atives second deriv derivativ ativ atives of log .p˜(Itx)isand low lower er bound conv conveys eys no log p ˜ ( x that only provide a low er bound on ) , b ecause score matching requires information ab about out its deriv derivatives. atives. This means that score matching cannot be the derivto atives and second deriv ativcomplicated es of log p˜( x)in and a lowerbetw bound eys no applied estimating mo the hidden models dels with interactions teractions etween een conv information out itsco deriv means that score matchingWhile cannot be units, such asabsparse coding dingatives. mo models delsThis or deep Boltzmann machines. score applied tocan estimating withthe complicated interactions etween the hidden matc matching hing be used mo to dels pretrain first hidden lay layer er of a b larger model, it has units, such as sparse co ding mo dels or deep Boltzmann machines. While score not been applied as a pretraining strategy for the deep deeper er lay layers ers of a larger mo model. del. matc hing can b e used to pretrain the first hidden lay er of a larger model, it has This is probably because the hidden lay layers ers of such mo models dels usually con contain tain some not beenvapplied discrete ariables.as a pretraining strategy for the deeper layers of a larger model. This is probably because the hidden layers of such models usually contain some Whilevariables. score matc matching hing do does es not explicitly ha hav ve a negative phase, it can be discrete view viewed ed as a version of contrastiv contrastivee div divergence ergence using a sp specific ecific kind of Mark Markoov chain While score matc hing do es not explicitly ha v e a negative phase, it can but be (Hyvärinen, 2007a). The Marko Markov v chain in this case is not Gibbs sampling, viewed aasdifferen a version of contrastiv divergence specificbykind Markov Score chain rather different t approach that emakes lo local cal using mo mov vesaguided theofgradient. (matc Hyvärinen 2007a ). The Marko v chain in this case vis cnot matching hing is, equiv equivalent alent to CD with this type of Marko Markov hainGibbs when sampling, the size of but the rather a differen t approach that makes lo cal mo v es guided b y the gradient. Score lo local cal mo mov ves approac approaches hes zero. matching is equivalent to CD with this type of Markov chain when the size of the Lyu (2009) generalized score matchin matching g to the discrete case (but made an error local moves approaches zero. in their deriv derivation ation that was corrected by Marlin et al. (2010)). Marlin et al. (2010) Lyu ( 2009 ) generalized score matchin g to the discrete case in (but made an error found that gener generalize alize alized d sc scor or ore e matching (GSM) do does es not work high dimensional in their deriv ation that w as corrected by Marlin et al. ( 2010 )). Marlin et al. (2010) discrete spaces where the observed probability of many even events ts is 0. found that generalized score matching (GSM) does not work in high dimensional A more successful approach to extending the basic ideas of score matching discrete spaces where the observed probability of many events is 0. to discrete data is ratio matching (Hyvärinen, 2007b). Ratio matching applies A more to successful approach extending the basic ideas of the score matching sp specifically ecifically binary data. Ratio to matc matching hing consists of minimizing av average erage ov over er to discrete data is ratio matching (Hyvärinen, 2007b). Ratio matching applies specifically to binary data. Ratio matching consists of minimizing the average over 621
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
examples of the follo following wing ob objectiv jectiv jectivee function: 2 examples of the following ob jectivenfunction: X 1 , L(RM)(x, θ) = (18.26) pmodel (x;θ ) 1 + j=1 pmodel1(f (x),j );θ) L (x, θ) = , (18.26) 1 + where f (x, j ) returns x with the bit at position j flipp flipped. ed. Ratio matching av avoids oids the partition function using the same tric trick k as the pseudolik pseudolikeliho eliho elihoo od estimator: in a where f ( x , j ) returns x with the bit at p osition j flipp ed. Ratio matching oids) ratio of two probabilities, the partition function cancels out. Marlin et al. av (2010 X trick as the pseudolikelihood estimator: in a the partition function using the same found that ratio matching outperforms pseudolikelihood SML, pseudolikelihoo d and GSM in terms ratio twoyprobabilities, the partition function cancels out. Marlin et images. al. (2010) of theofabilit ability of mo models dels trained with ratio matching to denoise test set found that ratio matching outperforms SML, pseudolikelihood and GSM in terms Lik Likee the pseudolikelihoo pseudolikelihood d estimator, ratio matching requires ev evaluations aluations of of the ability of models trained with ratio matching to denoise n test set images. p˜ per data poin oint, t, making its computational cost per up update date roughly n times higher Lik e the pseudolikelihoo d estimator, ratio matching requires n evaluations of p˜ than that of SML. per data point, making its computational cost per update roughly n times higher As with the pseudolikelihoo pseudolikelihood d estimator, ratio matching can be though thoughtt of as than that of SML. pushing down on all fantasy states that ha hav ve only one variable different from a As with the pseudolikelihoo d estimator, ratiosp matching be though t ofthis as training example. Since ratio matc matching hing applies specifically ecificallycan to binary data, pushing down on all fantasy states that ha v e only one v ariable different from means that it acts on all fantasy states within Hamming distance 1 of the data. a training example. Since ratio matching applies specifically to binary data, this Ratio matching canall also be useful as the basisHamming for dealing with high-dimensional means that it acts on fantasy states within distance 1 of the data. sparse data, such as word count vectors. This kind of data poses a challenge for Ratio matching candsalso be useful theisbasis for dealing with high-dimensional MCMC-based metho methods because the as data extremely exp expensive ensive to represent in sparse data, such as word count vectors. This kind of data p oses a challenge for dense format, yet the MCMC sampler do does es not yield sparse values until the mo model del MCMC-based methodst the because the extremely expensive to represent in has learned to represen represent sparsit sparsity y indata the is data distribution. Dauphin and Bengio yetthis the issue MCMC doan es not yield sparse values until the model (dense 2013)format, overcame by sampler designing unbiased sto stochastic chastic approximation to has learned to represen t the sparsit y in the data distribution. Dauphin and Bengio ratio matc matching. hing. The appro approximation ximation ev evaluates aluates only a randomly selected subset of (the 2013 ) o v ercame this issue by designing unbiased chastic approximation to terms of the ob objectiv jectiv jective, e, and do does es notan require the sto mo model del to generate complete ratio matc hing. The approximation evaluates only a randomly selected subset of fan fantasy tasy samples. the terms of the ob jective, and does not require the model to generate complete See Marlin and de Freitas (2011) for a theoretical analysis of the asymptotic fantasy samples. efficiency of ratio matc matching. hing. See Marlin and de Freitas (2011) for a theoretical analysis of the asymptotic efficiency of ratio matching.
18.5
Denoising Score Matc Matching hing
In some cases we ma may y wish to regularize score matc matching, hing, by fitting a distribution 18.5 Denoising Score Matching Z In some cases we may wish to regularize score matching, by fitting a distribution psmoothed (x) = p data(y)q(x | y)dy (18.27) p (x) = p (y)q(x y)dy (18.27) rather than the true p data . The distribution q ( x | y ) is a corruption pro process, cess, usually one that forms x by adding a small amount of noise| to y . rather than the true p . The distribution q ( x y ) is a corruption process, usually 622 one that forms x by adding a small amount of |noise to y . Z
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Denoising score matching is esp especially ecially useful because in practice we usually do not hav havee access to the true pdata but rather only an empirical distribution defined Denoisingfrom score is especially usefulwill, because practicecapacity we usually doe by samples it.matching Any consistent estimator giveninenough capacity, , mak make p havinto eto access true but rather onlyonanthe empirical defined pnot a set to of the Dirac distributions centered trainingdistribution points. Smo Smoothing othing model in b y samples from it. Any consistent estimator will, given enough capacity , mak by q helps to reduce this problem, at the loss of the asymptotic consistency prop property ertye p in Dirac distributions on the training points. Smo othing describ described edtoina set Sec.of 5.4.5 . Kingma and centered LeCun (2010 ) introduced a pro procedure cedure for q b y helps to reduce this problem, at the loss of the asymptotic consistency prop erty performing regularized score matching with the smo smoothing othing distribution q being describ ed in Sec. 5.4.5 . Kingma and LeCun ( 2010 ) introduced a procedure for normally distributed noise. performing regularized score matching with the smoothing distribution q being Recall from Sec. 14.5.1 that several auto autoenco enco encoder der training alg algorithms orithms are normally distributed noise. equiv equivalent alent to score matching or denoising score matching. These auto autoenco enco encoder der Recall from Sec. 14.5.1 that several auto enco der training alg orithms are training algorithms are therefore a way of oov vercoming the partition function equivalent to score matching or denoising score matching. These autoencoder problem. training algorithms are therefore a way of overcoming the partition function problem.
18.6
Noise-Con Noise-Contrastiv trastiv trastive e Estimation
Most techniques hniques for estimating mo models with in intractable tractable partition functions do not 18.6tec Noise-Con trastiv edels Estimation pro provide vide an estimate of the partition function. SML and CD estimate only the Most tec for partition estimatingfunction, models with intractable not gradien gradient t hniques of the log rather than thepartition partitionfunctions functiondo itself. provide an estimate of the partitiond function. SML and CDtities estimate only Score matching and pseudolikelihoo pseudolikelihood av avoid oid computing quan quantities related to the the gradien t of the log partition function, rather than the partition function itself. partition function altogether. Score matching and pseudolikelihood avoid computing quantities related to the Noise-c Noise-contr ontr ontrastive astive estimation (NCE) (Gutmann and Hyv Hyvarinen arinen, 2010) takes a partition function altogether. differen differentt strategy strategy.. In this approac approach, h, the probability distribution estimated by the Noise-c ontrastive estimationas(NCE) (Gutmann and Hyvarinen, 2010) takes a mo model del is represen represented ted explicitly different strategy. In this approach, the probability distribution estimated by the model is represented explicitly as(x) = log p˜model(x; θ) + c, log p model (18.28) log p (x)an = approximation log p˜ (x; θ) of + c, (18.28) where c is explicitly introduced as than − log Z ( θ). Rather estimating only θ, the noise contrastiv contrastivee estimation pro procedure cedure treats c as just where is explicitlyand introduced an approximation of, using ). Rather than ( θsame log Z θ and c sim anotherc parameter estimatesas simultaneously ultaneously ultaneously, the algorithm θ, the log c as estimating only noise contrastiv estimation pro treatsto −cedure pmodel (x ) th for both. The resulting thus use ma may y not corresp correspond ond exactly a vjust alid θ c another parameter and estimates and sim ultaneously , using the same algorithm probabilit probability y distribution, but will become closer and closer to being valid as the (x ) thus may not correspond exactly to a valid for b oth. The resulting estimate of c impro improv ves.1 log p probability distribution, but will become closer and closer to being valid as the Suc Such h of an capproach maximum um lik likelihoo elihoo elihood d as the estimate improves.would not be possible using maxim criterion for the estimator. The maxim maximum um lik likeliho eliho elihoo od criterion would choose to set Suc h an approach would not b e possible using maxim um likelihoo d as the c arbitrarily high, rather than setting c to create a valid probability distribution. criterion for the estimator. The maximum likelihood criterion would choose to set 1 NCE is alsohigh, applicable problems withc ato tractable function, where there is no c arbitrarily rathertothan setting create partition a valid probability distribution. need to introduce the extra parameter c . However, it has generated the most interest as a means of estimating models with difficult partition functions. 623
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
NCE works by reducing the unsup unsupervised ervised learning problem of estimating p(x) to that of learning a probabilistic binary classifier in whic which h one of the categories p(x) NCE w orks b y reducing the unsup ervised learning problem estimating corresp corresponds onds to the data generated by the mo model. del. This sup supervised ervisedoflearning problem to constructed that of learning a probabilistic binary classifier in which one in ofthis the sup categories is in such a way that maximum likelihoo likelihood d estimation supervised ervised corresp onds to the data generated b y the mo del. This sup ervised learning problem learning problem defines an asymptotically consistent estimator of the original is constructed in such a way that maximum likelihood estimation in this supervised problem. learning problem defines an asymptotically consistent estimator of the original Sp Specifically ecifically ecifically,, we introduce a second distribution, the noise distribution pnoise(x). problem. The noise distribution should be tractable to ev evaluate aluate and to sample from. W Wee p ( x). Sp ecifically , we introduce a second distribution, the noise distribution can no now w construct a mo model del over both x and a new, binary class variable y . In the The distribution should be tractable to evaluate and to sample from. We new noise join jointt mo model, del, we sp specify ecify that can now construct a model over both x and a new, binary class variable y . In the 1 new joint model, we specify that (18.29) p joint (y = 1) = , 2 1 , (18.29) p ( y = 1) = pjoint (x | y = 1) = pmodel (18.30) 2 (x), and
p (x y = 1) = p (x), pjoint (x || y = 0) = pnoise(x).
(18.30) (18.31)
and x In other words, y is a switch that p variable (x y = 0) determines =p (x). whether we will generate (18.31) from the mo model del or from the noise distribution. | that determines whether we will generate x In other words, y is a switch variable Wthe e can similar joint t model of training data. In this case, the from moconstruct del or froma the noisejoin distribution. switc switch h variable determines whether we draw x from the data or from the noise We can construct a, similar training the p train(y join = t1)model = 12 , of ptrain ( x | ydata. = 1) In = this p data(case, x ), and distribution. Formally ormally, h variable determines whether we draw x from the data or from the noise pswitc train(x | y = 0) = pnoise(x). (y = 1) = , p ( x y = 1) = p ( x ), and distribution. Formally, p now standard maximum likelihoo likelihood d learning on the sup supervise ervise ervised d y= 0) just = p use(x p W(excan ). | learning problem of fitting p joint to p train : | now just use standard maximum likelihood learning on the supervised We can learning problem of fitting p max to E p x,y∼p: train log pjoint (y | x). θ, c = arg (18.32) θ,c E θ, c = arg max log p (y x). (18.32) The distribution pjoint is essen essentially tially a logistic regression model del applied to the | mo difference in log probabilities of the mo model del and the noise distribution: The distribution p is essentially a logistic regression model applied to the pmodel x) distribution: difference in log probabilities of the model and the (noise pjoint(y = 1 | x) = (18.33) pmodel (x) + pnoise(x) p (x) (y = 1 x) = p (18.33) p 1 (x) + p (x) (18.34) |= (x) 1 + ppnoise 1 (x) model = (18.34) 1 1 + = (18.35) pnoise (x) 1 + exp log 1 pmodel (x) = (18.35) 1 + exp 624 log
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
p noise (x) = σ − log (18.36) pmodel (x) p (x) = σp (18.36) log = σ (log ( x ) − log (18.37) model p (pxnoise ) (x)) . − (log p so long (x) as log (xis )) .easy to back-propagate (18.37) log pp˜ model NCE is thus simple = toσapply through, and, as sp specified ecified ab abov ov ove, e,p noise is − easy to ev evaluate aluate (in order to ev evaluate aluate log p ˜ NCE is thus simple to apply so long as is easy to back-propagate pjoint) and sample from (in order to generate the training data). through, and, as specified above, p is easy to evaluate (in order to evaluate most successful when to applied to problems withdata). few random variables, ) andis sample from (in order generate the training p NCE but can work well ev even en if those random variables can take on a high num number ber of NCEFor is most successful applied to problems few random variables, values. example, it has when been successfully applied towith mo modeling deling the conditional but can workovwerella ev en ifgiven thosethe random variables can take on and a high numcuoglu ber of, distribution word con context text of the word (Mnih Ka Kavuk vuk vukcuoglu cuoglu, v alues. For example, it has been successfully to modeling the conditional 2013 2013). ). Though the word may be dra drawn wn from aapplied large vocabulary vocabulary, , there is only one distribution o v er a word given the con text of the word ( Mnih and Ka vukcuoglu, word. 2013). Though the word may be drawn from a large vocabulary, there is only one When NCE is applied to problems with many random variables, it becomes less word. efficien efficient. t. The logistic regression classifier can reject a noise sample by identifying When is applied problems with randomthat variables, it bslows ecomesdo less an any y one vNCE ariable whose vto alue is unlik unlikely ely ely.. many This means learning down wn efficien t. The logistic regression classifier can reject a noise sample b y identifying greatly after pmodel has learned the basic marginal statistics. Imagine learning a an ydel oneofvimages ariable of whose is unstructured unlikely. ThisGaussian means that learning wn mo model faces,value using noise as pnoiseslows . If pdo model greatlyabout after peyes, it hascan learned basicall marginal statistics. Imagine learning a learns rejectthe almost unstructured noise samples without mo del of images of faces,ab using unstructured Gaussian as p . If p ha having ving learned an anything ything about out other facial features, suchnoise as mouths. learns about eyes, it can reject almost all unstructured noise samples without The constrain constraintt that p noise must be easy to ev evaluate aluate and easy to sample from having learned anything ab out other facial features, such as mouths. can be ov overly erly restrictive. When pnoise is simple, most samples are likely to be to too o p The constrain t that m ust b e easy to ev aluate and easy to sample from ob obviously viously distinct from the data to force pmodel to impro improv ve noticeably noticeably.. can be overly restrictive. When p is simple, most samples are likely to be too Lik Likee score matc matching hing the and data pseudolikelihoo pseudolikelihood, do does es vnot work if only a low lower er obviously distinct from to force p d, NCE to impro e noticeably . bound on p˜ is available. Such a low lower er bound could be used to construct a low lower er Likeonscore NCE es not wan orkupper if only a lowon er p jointmatc (y =hing 1 | xand bound ), butpseudolikelihoo it can only bed,used to do construct bound ound ailable. Such er the bound could be used construct a ewise, lower pbjoint (y on = 0p˜| is x)a, vwhic which h app appears ears ainlow half terms of the NCE to ob objectiv jectiv jective. e. Lik Likewise, x p ( y = 1 ), not but useful, it can only be used tovides construct an upp upper bound ound on abound lo low weron bound on pnoise is because it pro provides only an upper er b on p ( y = 0 x ) , whic h app ears in half the terms of the NCE ob jectiv e. Lik ewise, | pjoint(y = 1 | x). a lower bound is not useful, because it provides only an upper bound on | on p When the mo model del distribution is copied to define a new noise distribution before (y = 1 x). p eac each h gradient step, NCE defines a pro procedure cedure called self-c self-contr ontr ontrastive astive estimation estimation,, | model distribution is copied to define a new noise distribution before When the whose expected gradien gradientt is equiv equivalen alen alentt to the exp expected ected gradien gradientt of maximum eac h gradient step, NCE defines a pro cedure called self-c ontrastive estimation lik likeliho eliho elihoo od (Go Goo odfello dfellow w, 2014). The sp special ecial case of NCE where the noise samples, whose expected gradien is equiv alen t to thethat expmaxim ected gradien t of maximum are those generated by tthe model suggests maximum um lik likelihoo elihoo elihood d can be lik eliho o d ( Go o dfello w , 2014 ). The sp ecial case of NCE where the noise samples in interpreted terpreted as a pro procedure cedure that forces a mo model del to constan constantly tly learn to distinguish are those generated b y the model suggests that maxim um lik elihoo d can be realit reality y from its own evolving beliefs, while noise con contrastiv trastiv trastivee estimation achiev achieves es in terpreted as a pro cedure that forces a mo del to constan tly learn to distinguish some reduced computational cost by only forcing the mo model del to distinguish reality realitya from own evolving beliefs, while noise contrastive estimation achieves from fixed its baseline (the noise mo model). del). some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (the noise model).625
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Using the sup supervised ervised task of classifying betw etween een training samples and generated samples (with the mo model del energy function used in defining the classifier) to provide Using the classifying earlier betweenintraining and generated a gradient onsup theervised mo model deltask wasofintroduced varioussamples forms (W elling et al. al.,, samples (with the mo del energy function used in defining the classifier) to provide 2003b; Bengio, 2009). a gradient on the model was introduced earlier in various forms (Welling et al., Noise con contrastiv trastiv trastivee estimation is based on the idea that a go goood generativ generativee mo model del 2003b; Bengio, 2009). should be able to distinguish data from noise. A closely related idea is that Noise contrastivemo estimation is b based idea that a good that generativ e model a goo good d generative model del should e ableontothegenerate samples no classifier should be able from to distinguish A closely relatednetw idea is (Sec. that can distinguish data. Thisdata idea from yieldsnoise. generative adversarial networks orks a goo d generative mo del should b e able to generate samples that no classifier 20.10.4 20.10.4). ). can distinguish from data. This idea yields generative adversarial networks (Sec. 20.10.4).
18.7
Estimating the Partition Function
While uch h of this chapterthe is dedicated to describing metho methods ds that av avoid oid needing 18.7 muc Estimating Partition Function Z ( θ to compute the intractable partition function ) asso associated ciated with an undirected While mucmo h of this is dedicated to sev describing metho ds that avoid needing graphical model, del, in chapter this section we discuss several eral metho methods ds for directly estimating to compute intractable partition function Z (θ ) associated with an undirected the partitionthe function. graphical model, in this section we discuss several methods for directly estimating Estimating the partition function can be imp important ortant because we require it if the partition function. we wish to compute the normalized likelihoo of data. This is often imp likelihood d important ortant in Estimating the partition function can b e imp ortant b ecause we require it to if evaluating the mo model, del, monitoring training performance, and comparing mo models dels w e wish to compute the normalized likelihood of data. This is often important in eac each h other. evaluating the model, monitoring training performance, and comparing models to For example, imagine we hav havee tw two o mo models: dels: mo model del MA defining a probabileach other. it ity y distribution p A (x; θ A) = Z1A p˜A (x ; θ A) and mo model del M B defining a probability For example, imagine we have two models: model defining a probabildistribution pB ( x; θB ) = Z1B p˜B(x; θ B ). A common wa way y to compare the mo models dels p˜ (x ; θ ) and model M defining a probability ity distribution p (x; θ ) = is to ev and compare the likelihoo that b oth mo assign to an i.i.d. evaluate aluate likelihood d models dels p Supp ( x; θose )= p˜ (xset ; θ consists distribution ). A common wa y to compare mo M )} dels m , x (m {x(1) , . . .the test dataset. Suppose the test of examples . If Q Q is to ev aluate and compare the likelihoo d that b oth mo dels assign to an i.i.d. ( i ) ( i ) equivalently alently if i p A(x ; θ A ) > i p B(x ; θ B) or equiv m , . . . , x x test dataset. Suppose the test set consists of examples . If X X (i) p (x ; θ ) > plog (xp (; xθ(i)); θor )equiv alently if } (18.38) log p B(x ; θB ) > {0, A A − i
i
(18.38) log p (x ; θ ) log p (x ; θ ) > 0, then we say that MA is a better mo model del than (or, at least, it is a b etter mo model del M B − Q Q of the test set), in the sense that it has a better test log-likelihoo log-likelihood. d. Unfortunately Unfortunately,, then wewhether say thatthis condition is a betterholds model than kno (or, at least, is a better model testing requires knowledge wledge of theitpartition function. of the test set),, Eq. inM the18.38 senseseems that ittohas a better test log-likelihoo Unfortunately X X M Unfortunately Unfortunately, require ev evaluating aluating the log d. probability that, testing whether this holds requires knorequires wledge of the partition function. the mo model del assigns tocondition eac each h point, whic which h in turn ev evaluating aluating the partition Unfortunately , Eq. 18.38 seems to require ev aluating the log probability that function. We can simplify the situation slightly by re-arranging Eq. 18.38 into a the model assigns to each point, which in turn requires evaluating the partition function. We can simplify the situation slightly by re-arranging Eq. 18.38 into a 626
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
form where we need to know only the ratio of the two mo model’s del’s partition functions: ! (del’s i); θ partition form of the twp˜oAmo functions: X where we need to Xknow only the ratioX ( x ) Z (θA) A − m log log pA (x (i); θA )− log p B(x(i); θ B ) = log . ( i ) Z (θB ) p˜B (x ; θB ) i i i p˜ (x ; θ ) Z (θ ) m log (18.39). log p (x ; θ ) log p (x ; θ ) = log Z (θ ) p˜ (x ; θ ) We can thus determine model del than M B without kno knowing wing − whether MA is a better mo − (18.39) the partition function of either mo model del but only their ratio. As we will see shortly shortly,, W e can thus determine whether is a b etter mo del than without kno wing ! we can estimate this ratio using imp importance ortance sampling, pro provided vided that the tw twoo mo models dels X partition functionX X their ratio. M the of either mo As we will see shortly, Mdel but only are similar. we can estimate this ratio using importance sampling, provided that the two models If, how however, ever, we wan anted ted to compute the actual probability of the test data under are similar. either MA or MB , we would need to compute the actual value of the partition If, however, e wanif ted compute the actual probability the test data under B) r = Z(θ functions. Thatwsaid, wetoknew the ratio of tw functions, , two o partition of Z(θA) either or , we would need to compute the actual value of the partition and we knew the actual value of just one of the tw two, o, say Z (θA ), we could compute functions. said, if we knew the ratio of two partition functions, r = , M ofThat the value theMother: and we knew the actual value of just one of the two, say Z (θ ), we could compute the value of the other: Z (θ ) = rZ (θ ) = Z (θB )Z (θ ). (18.40) B A A Z (θA ) Z (θ ) Z (θ ) = rZ (θ ) = Z (θ ). (18.40) θ ) A simple w way ay to estimate the partitionZ (function is to use a Mon Monte te Carlo metho method d suc such h as simple imp importance ortance sampling. We present the approac approach h in terms A simple w ay to estimate the partition function is to use a Mon Carlo of contin continuous uous variables using integrals, but it can be readily applied totediscrete d suc as simple ortance sampling. We present approac h in terms vmetho ariables by hreplacing theimp integrals with summation. We use athe prop proposal osal distribution v)ariables usingorts integrals, butsampling it can beand readily applied to discrete 1 pof0(contin x ) = Zuous p˜ 0(x which supp supports tractable tractable ev evaluation aluation of 0 vbariables by replacing the integrals with summation. W e use a prop osal distribution oth the partition function Z0 and the unnormalized distribution p˜0 (x). p (x ) = p˜ (x) which supports tractable sampling and tractable evaluation of Z both the partition function Z and the unnormalized distribution p˜ (x). Z1 = p˜1(x) dx (18.41) Z Z = p˜p0((xx)) dx (18.41) = p˜1(x) dx (18.42) p0 (x) p (x) Z = p˜ (p˜x1)(xd)x (18.42) = Z 0 p (px0)(x) dx (18.43) p˜0(x) Z p˜ (x) Kp (x) (k) =Z dx (18.43) X Z p ˜ ( x 0 1 p˜ ()x) s.t. : x(k) ∼ p Zˆ1 = Z (18.44) 0 (k)) K p ˜ ( x 0 Z k=1 p˜ (x ) Zˆ = s.t. : x (18.44) p K Z p˜ (x ) ˆ In the last line, we make a Mon Monte te Carlo estimator, Z 1, of∼the integral using samples dra drawn wn from p0 (x ) and then weigh eightt each sample with the ratio of the unnormalized In the last line, w e make a Mon te Carlo estimator, Zˆ , of the integral using samples p˜1 and the prop proposal osal p 0 . Xt each sample with the ratio of the unnormalized drawn from p (x ) and then weigh W e see also that this approach allows us to estimate the ratio betw etween een the p˜ and the proposal p . We see also that this approach allows 627 us to estimate the ratio b etween the
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
partition functions as K
1 X p˜1(x(k) ) s.t. : x(k) ∼ p0. (18.45) ( k ) K p˜0(x ) 1 k=1 p˜ (x ) s.t. : x p. (18.45) K directly p˜ (x to ) compare tw This value can then be used two o mo models dels as describ described ed in Eq. ∼ 18.39 18.39.. This value can then be used directly to compare two models as described in Eq. If the distribution p0 is close to p 1, Eq. 18.44 can be an effective wa way y of 18.39. X (Mink estimating the partition function Minka a, 2005). Unfortunately Unfortunately,, most of the time p is close to p , Eq.and 18.44 can obveeran effective way of p1 isIfbthe oth distribution complicated (usually multimodal) defined a high dimensional estimating partition function (Minkpa, that 2005is ). simple Unfortunately most of thewhile time space. It is the difficult to find a tractable enough, to ev evaluate aluate 0 p is b oth complicated (usually multimodal) and defined o v er a high dimensional still being close enough to p1 to result in a high quality approximation. If p0 and It is difficult to find a tractable p thathav isvsimple enough to ev aluatep while pspace. e low probability under 1 are not close, most samples from p 0 will ha 1 and p p still b eing close enough to to result in a high quality approximation. If therefore mak makee (relativ (relatively) ely) negligible con contribution tribution to the sum in Eq. 18.44. and p are not close, most samples from p will have low probability under p and Ha Having vingmak few samples with significan significant w weights eights to in the thissum suminwill therefore e (relativ ely) negligible conttribution Eq. result 18.44. in an estimator that is of poor qualit quality y due to high variance. This can be understo understoo od Ha ving ely fewthrough samplesanwith significan t vwariance eights in this estimate sum willZ in an ˆ1result quan quantitativ titativ titatively estimate of the of our : estimator that is of poor quality due to high variance. !This can be understood 2 K ˆ : Z ofX (k) ) of our estimate Z quantitatively through an estimate the vp˜ariance ( x 0 1 ˆ1 . Vˆar Zˆ1 = 2 Z − (18.46) K p˜0(x(k) ) k=1 Z p˜ (x ) Vˆar Zˆ = Zˆ . (18.46) K is significant p˜ (x ) deviation in the values of the This quantit quantity y is largest when there − (k) ) imp importance ortance weigh eights ts pp˜˜1 (x . ( k ) ) 0 (x when This quantity is largest there is significant deviation in the values of the e no now ww turn to tw two o related dev develop elop eloped ed to!cop copee with the challengimpW ortance eights . strategies X ing task of estimating partition functions for complex distributions ov over er highWe now turn to tw o related strategies developedand to cop e with the challengdimensional spaces: annealed imp importance ortance sampling bridge sampling. Both ing task of estimating partition functions for complex distributions ov er start with the simple imp importance ortance sampling strategy introduced ab abov ov ovee and highboth dimensional spaces: annealed imp ortance sampling and bridge sampling. Both attempt to overcome the problem of the prop proposal osal p 0 being to too o far from p1 by start with the simple imp ortance sampling strategy introduced ovebet and both p0 in intro tro troducing ducing in intermediate termediate distributions that attempt to bridge the ab gap etw ween attempt and p1 . to overcome the problem of the proposal p being too far from p by introducing intermediate distributions that attempt to bridge the gap between p and p . partition functions as
18.7.1
Annealed Imp Importance ortance Sampling
In situations where DKL (i.e., where there is little ov overlap erlap bet etw ween ( p0kortance p 1 ) is large 18.7.1 Annealed Imp Sampling p0 and p1), a strategy called anne anneale ale aled d imp importanc ortanc ortancee sampling (AIS) attempts to In situations where ) is large (i.e., where there is(Jarzynski little overlap bet een, D ( p p bridge the gap by introducing intermediate distributions , 1997 ;w Neal p and p ), a strategy called annealed impportanc e sampling (AIS) attempts to kof distributions 2001 2001). ). Consider a sequence η0 , . . . , p ηn , with 0 = η0 < η1 < · · · < the gap by introducing intermediate distributions (Jarzynski, 1997; Neal, ηbridge n−1 < ηn = 1 so that the first and last distributions in the sequence are p 0 and p 1 < 2001 ). Consider a sequence of distributions p , . . . , p , with 0 = η < η < resp respectiv ectiv ectively ely ely.. η < η = 1 so that the first and last distributions in the sequence are p and ···p 628 respectively.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
This approach allows us to estimate the partition function of a multimo ultimodal dal distribution defined ov over er a high-dimensional space (such as the distribution defined approach us to estimate themo partition of a multimo dal by aThis trained RBM).allows We begin with a simpler model del with afunction known partition function distribution defined overzero a high-dimensional space (such as distribution defined (suc (such h as an RBM with zeroes es for weigh weights) ts) and estimate thethe ratio betw etween een the two b y a trained RBM). W e b egin with a simpler mo del with a known partition function mo model’s del’s partition functions. The estimate of this ratio is based on the estimate (suc h as an RBM with zeroes weigh ts) and estimate thesuc ratio etween the twofo of the ratios of a sequence of for man many y similar distributions, such h asbthe sequence model’swith partition functions. The bestimate of this is basedweigh on the RBMs weigh eights ts in interp terp terpolating olating et etw ween zero andratio the learned weights. ts. estimate of the ratios of a sequence of man y similar distributions, such as the sequence of 1 We can now write the ratio Z Z0 bas RBMs with weights interpolating etween zero and the learned weights. We can now write the Zratio 1 = Z0 Z = Z =
Zη Zη1 Z1 as · · · n−1 (18.47) Z0 Zη1 Zη n−1 Z Z Z Zη (18.47) Zη1 Zη2 Z1 Z Z (18.48) · · ·Z n−1 Z0 Zη · · · Zη n−2 Z ηn−1 Z Z Z1 Z (18.48) = n−1 Y Z η Z = Z Z j+1· · · Z (18.49) Zη j=0 Z j = (18.49) Pro Provided vided the distributions pηj and pZηj+1 , for all 0 ≤ j ≤ n − 1, are sufficiently Zη
j+1 close, we can estimate factors importance ortance n simple p each Provided the reliably distributions and pof the, for all 0 Z ηj j using 1, are imp sufficiently Z sampling and reliably then useestimate these toeach obtain an estimate .using−simple importance ≤ Z 10 ≤ Y close, we can of the factors of Where do these in intermediate termediate distributions come from? Just as the original sampling and then use these to obtain an estimate of . prop proposal osal distribution p0 is a design choice, so is the sequence of distributions theseis,inittermediate distributions come from? as the original pη1 .Where . . pη n−1do . That can be sp specifically ecifically constructed to suit Just the problem domain. p prop osal distribution is a design choice, so is the sequence of distributions One general-purp general-purpose ose and popular choice for the in intermediate termediate distributions is to p . . . p . That is, it can b e sp ecifically constructed to suit the problem domain. p 1 and use the weigh eighted ted geometric average of the target distribution the starting One general-purp ose (for and whic popular for the intermediate distributions is to prop proposal osal distribution which h thechoice partition function is known) p0: use the weighted geometric average of the target distribution p and the starting η 1−ηj proposal distribution (for which the is known) p : ∝ p1j p 0 function pηj partition (18.50)
p p (18.50) In order to sample from these pintermediate distributions, we define a series of 0 x | x) that define the conditional probability Mark Markoov chain transition functions T ηj (∝ In order to sample from these intermediate distributions, a series of 0 x. define distribution of transitioning to x giv given en we are currently atwe The transition T ( x x Mark o v chain transition functions ) that define the conditional probability op operator erator Tηj (x0 | x) is defined to leav inv variant: leavee pηj (x) in distribution of transitioning to x given| we are currently at x. The transition Z operator T (x x) is defined to leave p 0 (x) invariant: pηj (x) = p ηj (x )Tη j (x | x0) dx0 (18.51) | (x x v) dchain (x) = p as (x any )T Marko x (18.51) These transitions ma may y bpe constructed Markov Mon Monte te Carlo method | ds inv (e.g., Metrop Metropolis-Hastings, olis-Hastings, Gibbs), including metho methods involving olving multiple passes These transitions ma y b e constructed as any Marko v chain Monte Carlo method through all of the random variables or other kinds of iterations. (e.g., Metropolis-Hastings, Gibbs), methods involving multiple passes Z including 629 through all of the random variables or other kinds of iterations.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
The AIS sampling strategy is then to generate samples from p0 and then use the transition op operators erators to sequen sequentially tially generate samples from the intermediate The AIS sampling strategy is then generate samples from p and distributions un until til we arriv arrivee at samplestofrom the target distribution p1: then use the transition operators to sequentially generate samples from the intermediate distributions we arrive at samples from the target distribution p : • for k = un 1 . til ..K (k)
•
for–k Sample = 1...K xη1 ∼ p 0(x) (k)
(k)
(k)
Sample p η(1x(x ) η2 | xη 1 ) Sample x xη2 ∼ T .Sample .. x ) x ∼ T (x (k) (k) (k) ∼ T ( x – xηn−1∼ ηn−2 | η n−1 | xηn−2 ) – Sample ... (k) (k) (k) – Sample xηn ∼ TηTn−1(x(ηxn | xηn−1 x) ) – – – –
• end– Sample x
∼ T
(x
x|
)
∼ | Forend sample k, we can derive the imp importance ortance weigh weightt by chaining together the •ortance weigh imp importance weights ts for the jumps betw etween een the in intermediate termediate distributions given in k F or sample , w e can derive the imp ortance weigh t by chaining together the Eq. 18.49. importance weights for the jumps b etw een the in termediate (k) (k) (k) distributions given in p˜η 1 (xη1 ) p˜η 2(x η 2 ) p˜1 (x1 ) (k) w = ... Eq. 18.49. (18.52) (k) (k) (k) p ˜ ( x ) p ˜ ( x ) p ˜ ( x ) ηn−1 p˜ 0 (x0 ) p˜η 1(x η 1 ) p˜ (x η n−1 ) w = ... (18.52) To av avoid oid computational issues overflo erflo erflow, best to do the p˜ (x such ) p˜ as(xov ) w,p˜ it is(xprobably ) computation in log space: To avoid computational issues such as overflow, it is probably best to do the computation in log space: log w (k) = log p˜η1 (x) − log p˜0(x) + . . . (18.53) log w thus = log p˜ (x)andlog (x)ortance + . . . weigh With the sampling pro procedure cedure defined thep˜ imp importance eights ts given(18.53) in Eq. 18.52 18.52,, the estimate of the ratio of partition−functions is given by: With the sampling procedure thus defined and the importance weights given in Eq. K 18.52, the estimate of the ratio of is given by: Z1partition 1 Xfunctions (k) ≈ w (18.54) Z0 K k=1 Z 1 w (18.54) Z K In order to verify that this procedure defines a valid imp importance ortance sampling ≈ sc scheme, heme, we can sho show w (Neal, 2001) that the AIS pro procedure cedure corresp corresponds onds to simple In order sampling to verify on that procedure a valid importance imp importance ortance anthis extended statedefines space with points sampledsampling ov over er the XAIS procedure corresponds to simple sc heme, w e can sho w ( Neal , 2001 ) that the pro product duct space [x η 1, . . . , xηn−1 , x 1]. To do this, we define the distribution ov over er the imp ortance sampling on an extended state space with p oints sampled ov er the extended space as: , x ]. To do this, we define the distribution over the product space [x , . . . , x extended (18.55) p˜(xspace , xη n−1, x 1 ) η 1 , . . .as: ˜ ˜ ˜ =pp˜˜1((xx1), T (18.56) (18.55) . .η.n−1 , x(xηn−1 , x | )x 1)Tηn−2 (x ηn−2 | xη n−1 ) . . . Tη1 (xη1 | xη2 ), =p˜ (x )T˜ (x (x 630 x ) . . . T˜ (x x )T˜ x ), (18.56) |
|
|
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
where T˜a is the reverse of the transition op operator erator defined by T a (via an application of Ba Bay yes’ rule): where T˜ is the reverse of the transition operator defined by T (via an application of Bayes’ rule): ˜ 0 p a (x 0 ) p˜a (x0) (18.57) T a(x | x) = Ta(x | x 0 ) = Ta(x | x0 ). pa (x) p˜a (x) p (x ) p˜ (x ) (18.57) T˜ (x x) = T (x x ) = T (x x ). p (x) p˜ (xdistribution ) Plugging the ab abo ove in into to the expression for the joint on the extended | 18.56, we get: | | state space giv given en in Eq. Plugging the above into the expression for the joint distribution on the extended state in Eq. 18.56, we get: p˜(xηspace , . . . ,giv xηen (18.58) n−1 , x 1) 1 n−2 Y p˜η (xη ) p˜, x ) η ) (xη p˜(x , . . . , x (18.58) i i = p˜1(x1 ) n−1 n−1 T ηn−1(x 1 | xηn−1 ) Tη (xη | xη i) p˜ηn−1 (x1 ) p˜ηi (xηi+1 ) i i+1 i=1 p p˜ (x ) ˜ (x ) x(18.59) = p˜ (x ) T (x x ) T (x ) p˜ (x ) p˜ (x ) | | n−2 Y p˜ηi+1 (x ηi+1) p˜1(x1) (18.59) = Tηn−1 (x 1 | xηn−1 ) p˜η 1(x η1) T ηi(x ηi+1 | x ηi ). p˜ ηn−1(x1) p˜ (x ) i=1 p ˜ ηi (xηi+1 ) p˜ (x ) Y = T (x x ) p˜ (x ) T (x x ). (18.60) p˜ (x ) p˜ (x ) | | (18.60) We now hav havee means of generating samples from the joint prop proposal osal distribution q over the extended sample via a sampling sc scheme heme giv given en ab abo ove, with the joint W e now hav e means of generating samples from the joint prop osal distribution Y distribution giv given en by: q over the extended sample via a sampling scheme given above, with the joint distribution by: q(xη 1, . . .giv , xen , x 1) = p 0(xη1)T η1 (xη2 | xη 1 ) . . . Tη n−1 (x1 | x ηn−1). (18.61) ηn−1
q(xe a, .joint . . , x distribution , x ) = p on (x the )T extended (x ) . . . Tgiven(xby Eq. ). . (18.61) x space x 18.60 We hav have Taking q (xη 1, . . . , x ηn−1 , x1) as the prop | | state space from proposal osal distribution on the extended W e hav e a joint distribution on the extended space given by Eq. 18.60 . Tts: aking whic which h we will dra draw w samples, it remains to determine the imp importance ortance weigh weights: q (x , . . . , x , x ) as the proposal distribution on the extended state space from (k) (imp k) ortance (k) which we will w samples, it remains p˜(xdra p˜ 1(to x1determine ) p˜ηthe ˜η1 (x η1 )weights: η1 , . . . , xη n−1 , x1 ) 2 (x η 2 ) p w(k) = = . . . . (18.62) (k) (k) q(xη 1 , . . . , xη n−1 , x1 ) p˜η (x(ηk) ) p ˜ ( x ) p ˜ ( x ) η n−1 p˜(x , . . . , x ,x ) p˜ (x n−1 ) p˜ 1 (x 1 ) p˜ 0 (x0 ) w = = ... . (18.62) q(tsx are , . .the . , xsame, xas) prop p˜ osed(xfor AIS. ) Thus p˜ (x we) can p˜ (x ) These weigh weights proposed interpret AIS as simple imp importance ortance sampling applied to an extended state and its validity follows These weigh are the the vsame osed for sampling. AIS. Thus we can interpret AIS as immediatelytsfrom alidityasofprop imp importance ortance simple importance sampling applied to an extended state and its validity follows Annealed from imp importance ortance sampling (AIS) was sampling. first discov discovered ered by Jarzynski (1997) immediately the validity of imp ortance and then again, independently independently,, by Neal (2001). It is currently the most common importance sampling (AIS)for wasundirected first discov ered by Jarzynski wayAnnealed of estimating the partition function probabilistic mo models. dels.(1997 The) and thenforagain, independently by do Neal (2001 It is currently most common reasons this ma may y hav havee more, to with the).publication of anthe influential pap paper er w a y of estimating the partition function for undirected probabilistic mo dels. The (Salakh Salakhutdino utdino utdinov v and Murra Murray y, 2008) describing its application to estimating the reasons for this maofy restricted have moreBoltzmann to do withmachines the publication of belief an influential er partition function and deep netw networks orkspap than (Salakhutdinov and Murray, 2008) describing its application to estimating the 631 machines and deep belief networks than partition function of restricted Boltzmann
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
with any inherent adv advan an antage tage the metho method d has belo elow. w. with any inherent advantage the method has A discussion of the properties of the AIS below. efficiency) can be found in Neal (2001). A discussion of the properties of the AIS efficiency) can be found in Neal (2001).
18.7.2
over the other metho method d describ described ed over the other method described estimator (e.g.. its variance and estimator (e.g.. its variance and
Bridge Sampling
Bridge (1976) is another metho method d that, like AIS, addresses the 18.7.2 sampling BridgeBennett Sampling shortcomings of importance sampling. Rather than chaining together a series of Bridge sampling Bennett (bridge 1976) is anotherrelies metho AIS, addresses the p ∗, kno in intermediate termediate distributions, sampling onda that, singlelike distribution known wn shortcomings of importance sampling. Rather than chaining together a series of as the bridge, to in interp terp terpolate olate betw etween een a distribution with known partition function, p function distributions, relies to on estimate a single distribution , known pin0termediate p1 forbridge , and a distribution whichsampling we are trying the partition as .the bridge, to interpolate between a distribution with known partition function, Z 1 p , and a distribution p for which we are trying to estimate the partition function expected ected imp impororZ .Bridge sampling estimates the ratio Z1 /Z0 as the ratio of the exp tance weigh eights ts bet etw ween p˜0 and p˜∗ and bet etw ween p˜1 and p˜∗ : /Z as the ratio of the expected imporBridge sampling estimates the ratio Z , K K (k) (k) X X Z p ˜ ( x ) ∗ b ∗ (x1p tance weights between p˜ 1and p˜ and ˜ p˜and ˜ :) 0 etween p (18.63) ≈ (k) (k) Z0 p ˜ ( x ) p ˜ ( x ) k=1 p k=1 p Z ˜0 (x0 ) ˜1 (x1 ) (18.63) If the bridge distribution carefully top˜hav have e a) large ov overlap erlap of supp support ort Z p∗ is chosen p ˜ ( x ) ( x ≈ with both p0 and p1 , then bridge sampling can allow the distance betw etween een two p If the bridge distribution is c hosen carefully to hav e a large ov erlap of support distributions (or more formally formally,, D KL( p0 kp1 )) to be muc much h larger than with standard ,can allow the distance between two with both psampling. and p , then bridge sampling imp importance ortance X X distributions (or more formally, D ( p p )) to be much larger than with(opt standard ) p (x ) ∝ It can b e shown that the optimal bridging distribution is given b y ∗ importance sampling. k p˜0(x)˜ p 1(x) r = Z 1/Z 0 . At first, this app appears ears to be an un unw work orkable able solution r˜ p0(x p1 (xb ) ewhere (x ) It)+˜ can shown that the optimal bridging distribution is given by p as it would seem to require the very quantit quantity y we are trying to estimate, Z1 /Z0. where r = Z /Z . At first, this appears to be an unworkable solution ∝ Ho How wev ever, er, it is possible to start with a coarse estimate of r and use the resulting as it would seem to to require quantit y we ely are (trying to estimate, bridge distribution refine the ourvery estimate iterativ iteratively Neal, 2005 ). That Zis,/Zwe. Ho wever, is possible the to start a coarse estimate to of up resulting r and iterativ iteratively elyitre-estimate ratio with and use each iteration update dateuse thethe value of r. bridge distribution to refine our estimate iteratively (Neal, 2005). That is, we iteratively re-estimate the ratio and use each iteration to update the value of r. havee their adLink Linked ed imp importance ortance sampling Both AIS and bridge sampling hav vantages. If DKL( p0kp 1) is not to too o large (b (because ecause p0 and p1 are sufficien sufficiently tly close) Link ed imp ortance sampling Both AIS and bridge sampling hav e adbridge sampling can be a more effective means of estimating the ratio of their partition ( p If, p o are p ho vfunctions antages. than If D AIS. ) iswev not o large (because p and close) how ever, er,tothe two distributions are to too far sufficien apart fortlya single bridge sampling can b e a more effective means of estimating the ratio of partition k distribution p ∗ to bridge the gap then one can at least use AIS with poten otentially tially functions than AIS. If, ho w ev er, the t w o distributions are to o far apart for a single p p man many y intermediate distributions to span the distance bet etw ween 0 and 1. Neal p ho towbridge the imp gap ortance then one can at least use AIS withthe poten tially (distribution 2005) show showed ed how his linked importance sampling metho method d lev leveraged eraged pow ower er of p p man y intermediate distributions to span the distance b et w een and . Neal the bridge sampling strategy to bridge the intermediate distributions used in AIS (to 2005 ) showed w hisvelinked ortance sampling metho d leveraged the power of significan significantly tlyho impro improv the ovimp erall partition function estimates. the bridge sampling strategy to bridge the intermediate distributions used in AIS 632 to significantly improve the overall partition function estimates.
CHAPTER 18. CONFRONTING THE PARTITION FUNCTION
Estimating the partition function while training While AIS has become accepted as the standard metho method d for estimating the partition function for many Estimating the partition function while trainingintensiv WhileeAIS become undirected mo models, dels, it is sufficiently computationally intensive thathas it remains accepted as method Ho forwestimating the partition function foremany infeasible to the use standard during training. How ev ever, er, alternativ alternative e strategies that hav have been undirected mo dels, it is sufficiently computationally intensiv e that it remains explored to main maintain tain an estimate of the partition function throughout training infeasible to use during training. However, alternative strategies that have been Using a combination of bridge sampling, short-chain AIS and parallel temp tempering, ering, explored to maintain an estimate of the partition function throughout training Desjardins et al. (2011) devised a sc scheme heme to track the partition function of an Using a combination of bridge sampling, short-chain AIS and parallel tempering, RBM throughout the training process. The strategy is based on the maintenance of Desjardins et al. ( 2011 ) devised a sc heme to track the partition function of an indep independen enden endentt estimates of the partition functions of the RBM at ev every ery temperature RBM throughout the training process. The strategy is based on thebridge maintenance of op operating erating in the parallel temp tempering ering sc scheme. heme. The authors com combined bined sampling independenof t estimates partitionfunctions functions of of neigh the RBM at ev ery temperature estimates the ratiosofofthepartition neighb boring chains (i.e. from op erating in the parallel temp ering sc heme. The authors com bined bridge parallel temp tempering) ering) with AIS estimates across time to come up with a lowsampling variance estimates of the ratios of partition functions of neigh b oring chains (i.e. from estimate of the partition functions at every iteration of learning. parallel tempering) with AIS estimates across time to come up with a low variance The to tools ols describ described ed in this chapter provide man many y different wa ways ys of overcoming estimate of the partition functions at every iteration of learning. the problem of intractable partition functions, but there can be several other The tools in this and chapter manmo y different ways ofamong overcoming difficulties in inv vdescrib olv olved ed inedtraining usingprovide generative models. dels. Foremost these the problem of intractable partition functions, but there can b e several other is the problem of intractable inference, which we confront next. difficulties involved in training and using generative models. Foremost among these is the problem of intractable inference, which we confront next.
633
Chapter 19 Chapter 19
Appro Approximate ximate inference ximate Appro inference Man Many y probabilistic mo models dels are difficult to train because it is difficult to perform inference in them. In the context of deep learning, we usually ha hav ve a set of visible Man y probabilistic moof dels aret difficult tohtrain because it isofdifficult to pusually erform v and a set v ariables laten latent variables . The challenge inference inference in them. In problem the context of deep learning, ve ectations a set of visible p(h | vwe refers to the difficult of computing ) orusually takinghaexp expectations with v h v ariables and a set of laten t v ariables . The c hallenge of inference usually resp respect ect to it. Suc Such h op operations erations are often necessary for tasks like maxim maximum um lik likelihoo elihoo elihood d p ( refers to the difficult problem of computing ) or taking exp ectations with h v learning. respect to it. Such operations are often necessary for | tasks like maximum likelihood Man Many y simple graphical mo models dels with only one hidden la lay yer, suc such h as restricted learning. Boltzmann machines and probabilistic PCA, are defined in a wa way y that makes Man y simple graphical mo dels with only one hidden la y er, suc h as restricted inference op operations erations like computing p(h | v ), or taking exp expectations ectations with resp respect ect Boltzmann and probabilistic PCA, are defined in a wa y that makes to it, simple.machines Unfortunately most graphical mo with multiple la ers of hidden Unfortunately,, models dels lay y inference erations like computing expectations respect h v ), or taking v ariables op ha hav ve in intractable tractable posteriorp(distributions. Exact inferencewith requires an to it, simple. Unfortunately , most graphical mo dels with multiple la y ers of hidden | exp exponen onen onential tial amount of time in these mo models. dels. Ev Even en some mo models dels with only a single v ariables ha v e in tractable p osterior distributions. Exact inference requires an la lay yer, suc such h as sparse co coding, ding, ha hav ve this problem. exponential amount of time in these models. Even some models with only a single In this chapter, we introduce sev several eral of the techniques for confron confronting ting these layer, such as sparse coding, have this problem. in intractable tractable inference problems. Later, in Chapter 20, we will describe how to use Intechniques this chapter, we introduce several of the techniques for confron ting these these to train probabilistic mo models dels that would otherwise be intractable, in tractable inference problems. Later, in Chapter 20 , we will describe how to use suc such h as deep belief netw networks orks and deep Boltzmann machines. these techniques to train probabilistic models that would otherwise be intractable, In Intractable tractable inference problems in deep learning usually arise from interactions such as deep belief networks and deep Boltzmann machines. bet etw ween laten latentt variables in a structured graphical mo model. del. See Fig. 19.1 for some In tractable inference problems in deep learning usually arise from examples. These interactions ma may y be due to direct interactions in interactions undirected b et w een laten t v ariables in a structured graphical mo del. See Fig. 19.1 for same some mo models dels or “explaining aw awa ay” interactions betw between een mutual ancestors of the examples. interactions visible unit These in directed mo models. dels. may be due to direct interactions in undirected models or “explaining away” interactions between mutual ancestors of the same visible unit in directed models. 634 634
CHAPTER 19. APPROXIMATE INFERENCE
Figure 19.1: In Intractable tractable inference problems in deep learning are usually the result of in interactions teractions bet etw ween latent variables in a structured graphical mo model. del. These can be due to Figure 19.1: In tractable inference problems in deep learning are usuallypaths the result of edges directly connecting one latent variable to another, or due to longer that are in teractions b et w een latent v ariables in a structured graphical mo del. These can b e due to activ activated ated when the child of a V-structure is observed. (L (Left) eft) A semi-r semi-restricte estricte estricted d Boltzmann edges directly connecting oneton latent variable another, orb et due to longer are machine (Osindero and Hin Hinton , 2008 ) with to connections etween ween hiddenpaths units.that These (L eft) semi-r estricte d Boltzmann activated when the bchild of a V-structure is observed. A direct connections etw latent v ariables mak the p osterior distribution in etween een makee intractable tractable machine (Osindero and Hinton , 2008) (Center) with connections b etween hidden units. These due to large cliques of latent variables. A deep Boltzmann machine, organized direct b etwwithout een latent variables make the pstill osterior intractable in into to la lay yconnections ers of variables intra-la intra-lay yer connections, has distribution an intractable p osterior (Center) due to large cliques variables. A deep Boltzmann machine,mo organized distribution due to of thelatent connections b etw etween een lay layers. ers. (Right) This directed model del has in to la y ers of v ariables without intra-la y er connections, still has an intractable p osterior interactions teractions b etw etween een latent variables when the visible variables are observed, b ecause distribution due vto the connections b etwSome een lay ers. (Right)mo This del has ev every ery two latent ariables are co-parents. probabilistic models delsdirected are able mo to prov provide ide interactions b etween latent variables when the visible variables b ecause tractable inference ov over er the latent variables despite having one ofare theobserved, graph structures every twoab latent variables are co-parents. Some probabilistic dels are able provide depicted abo ove. This is p ossible if the conditional probability mo distributions areto chosen to tractable er theendences latent vb ariables despite having of the graph in intro tro troduce duce inference additionalovindep independences eyond those describ described ed one by the graph. Forstructures example, depicted ab ove. This p ossible if the conditional distributions are chosen to probabilistic PCA has isthe graph structure shown inprobability the right, yet still has simple inference in tro duce additional indep endences b eyond those describ ed by the graph. F or example, due to sp special ecial prop properties erties of the sp specific ecific conditional distributions it uses (linear-Gaussian probabilistic with PCA mutually has the graph structure shown in the right, yet still has simple inference conditionals orthogonal basis vectors). due to sp ecial prop erties of the sp ecific conditional distributions it uses (linear-Gaussian conditionals with mutually orthogonal basis vectors).
635
CHAPTER 19. APPROXIMATE INFERENCE
19.1
Inference as Optimization
Man Many approaches hes to confron confronting ting the problem of difficult inference make use of 19.1y approac Inference as Optimization the observ observation ation that exact inference can be describ described ed as an optimization problem. Man y approac hes to confron ting the problem of difficult make usethe of Appro Approximate ximate inference algorithms ma may y then be derived binference y appro approximating ximating the observ ation that exact inference can b e describ ed as an optimization problem. underlying optimization problem. Approximate inference algorithms may then be derived by approximating the To construct the optimization problem, assume we hav havee a probabilistic mo model del underlying optimization problem. consisting of observed variables v and laten latentt variables h. We would like to compute T o construct the optimization problem, have a probabilistic model p(v ; θ).we the log probabilit probability y of the observ observed ed data, logassume Sometimes it is to too o difficult v and consisting observed t variables e wouldwe likecan to compute p(v ; θ) vifariables to computeoflog it is costly tolaten marginalize outhh. .W Instead, compute log p(bvound ; θ). Sometimes log observ is too difficult θ,the q) on log ped (v;data, θ ). This athe low lower er probabilit bound L (yv,of is called theitevidenc evidence e lower log p ( v ; θ h to compute ) if it is costly to marginalize out . Instead, we can compute bound (ELBO). Another commonly used name for this low lower er bound is the negative ( v , θ , q log p ( v ; θ a low er b ound ) on ) . This b ound is called evidenctoe b lower variational fr freee ener energy gy gy.. Specifically Specifically,, the evidence low lower er bound the is defined e bound (ELBO). LAnother commonly used name for this lower bound is the negative variational freeLener evidence ound be (v, θgy, .q)Specifically = log p(v ; ,θthe )−D | v)er kpb(h | v; is θ))defined to (19.1) KL (q (h low (v, θ, q)probability = log p(v ; θdistribution ) D (q(ov h erv)h.p(h v; θ)) (19.1) where q is an arbitrary over − p( v) and |L(vk, θ, q)| is giv Because theL difference betw etween een log given en by the KL where q is an arbitrary probability distribution over h. div divergence ergence and because the KL divergence is alwa always ys non-negativ non-negative, e, we can see that log p ( v ( v , θ , q Because the difference b etw een ) and ) is giv en by the L alw alwa ays has at most the same value as the desired log probability probability. . The two KL are divergence because divergence is alwa e, we can see that equal if andand only if q is the the KL same distribution as pys (L hnon-negativ | v). always has at most the same value as the desired log probability. The two are q Surprisingly Surprisingly, , Lifcan e considerably easier to equal if and only q isbthe same distribution as compute p(h v). for some distributions . L Simple algebra shows that we can rearrange L in into to a muc much h more con conv venien enientt form: | Surprisingly, can be considerably easier to compute for some distributions q. Simple algebra shows that we can rearrange into a much more convenient form: L L(v, θ, q) = log p(v; θ) − DKL (q(h | v)kp(hL| v; θ)) (19.2) q(h | v) (v, θ, q) = log p(v; θ) − D q(h v) p(h v; θ)) (19.2) Eh∼q(log (19.3) p(h | v ) L −E q|(h k v) | = log p(v; θ) log q(h | v) (19.3) (19.4) h,v| ;θv)) = log p(v; θ) − Eh∼q log pp((h − q(p(hv;|θv) ) E (19.4) log = log p(v; θ) = log p(v; θ) − Eh∼q [log q(h || v) − log p(h, v; θ) + log p(v ; θ)] (19.5) − = − Eh∼q [log qE(h | v) − log p(h, v; θ)] . (19.6) = log p(v; θ) [log q(h v) log p(h, v; θ) + log p(v ; θ)] (19.5) E −q(h definition = more canonical [log v) log |pof (hthe ,− v; θevidence )] . (19.6) This yields the lo low wer bound, − | − This yields the more canonical ofpthe L(v, θ, q) definition = Eh∼q [log (h, vevidence )] + H (qlo ).wer bound, (19.7) E , θ, q) of = q, L is [logtractable p(h, v)] + (q). For an appropriate(v choice toHcompute. For any c(19.7) hoice L of q , L pro provides vides a low lower er bound on the likelihoo likelihood. d. For q (h | v) that are better For an appropriate choice of q, is tractable to compute. For any choice 636 of q , provides a lower bound on the L likelihood. For q (h v) that are better L |
CHAPTER 19. APPROXIMATE INFERENCE
appro approximations ximations of p( h | v), the low lower er bound L will be tighter, in other words, closer to log p( v). When q( h | v ) = p (h | v ), the approximation is perfect, and appro L (v, θximations , q) = log pof (v ;pθ( )h. v), the lower bound will be tighter, in other words, log p ( v closer to ). When| q( h v ) = p (h v ), the L approximation is perfect, and W e can th thus us think as the pro procedure cedure for finding the q that maximizes (v, θ, q) = log p(v ; θ)of . inference | | L. Exact inference maximizes L perfectly by searching ov over er a family of functions L We can thus think of inference as the procedure for finding the q that maximizes q that includes p(h | v ). Throughout this chapter, we will show how to derive . Exact inference maximizes perfectly bbyy using searching ovximate er a family of functions differen different t forms of appro approximate ximate inference appro approximate optimization to qfind p ( h v includes ) . Throughout this chapter, we will show how to derive L that L q . We can mak makee the optimization pro procedure cedure less exp expensive ensive but approximate differen t forms of appro ximate inference b y using appro ximate optimization to | by restricting the family of distributions q the optimization is allow allowed ed to search . W theerfect optimization procedure less exp ensive ofind ver qor bye can usingmak aneimp imperfect optimization pro procedure cedure that ma may ybut notapproximate completely b y restricting the family of distributions the optimization is allow ed to search q maximize L but merely increase it by a significant amount. over or by using an imperfect optimization procedure that may not completely L is a low No matter choice of q wite by use, lower er bound. We can get tighter maximize butwhat merely increase a significant amount. or lo looser oser bounds that are cheaper or more exp expensiv ensiv ensivee to compute dep depending ending on L what choice of q we use, is a lower bound. We can get tighter ho how wNo wematter ch cho oose to approach this optimization problem. W Wee can obtain a poorly or looser ounds that the are computational cheaper or more e to ending on Lcostexp matc matched hed qbbut reduce byensiv using an compute imp imperfect erfectdep optimization ho w we ch o ose to approach this optimization problem. W e can obtain a poorly pro procedure, cedure, or by using a perfect optimization pro procedure cedure ov over er a restricted family of q matc hed but reduce the computational cost b y using an imp erfect optimization q distributions. procedure, or by using a perfect optimization procedure over a restricted family of q distributions.
19.2
Exp Expectation ectation Maximization
The algorithm we in intro tro troduce duce based on maximizing a low lower er bound L is the 19.2first Exp ectation Maximization exp expeectation maximization (EM) algorithm, a popular training algorithm for mo models dels The first troducee here baseda on lower bound isedthe with latentalgorithm variables.we Weindescrib describe viewmaximizing on the EM aalgorithm dev develop elop eloped by exp e ctation maximization (EM) algorithm, a p opular training algorithm for mo dels L Neal and Hinton (1999). Unlike most of the other algorithms we describ describee in this with latent v ariables. W e describ e here a view on the EM algorithm ed by chapter, EM is not an approach to approximate inference, but rather dev an elop approach Neal and Hinton (1999 ). Unlike most of the other algorithms we describe in this to learning with an approximate posterior. chapter, EM is not an approach to approximate inference, but rather an approach The EMwith algorithm consists of palternating etw ween tw two o steps until con conv vergence: to learning an approximate osterior. bet EM algorithm consists step): of alternating etween the twovsteps until vergence: •The The E-step (Exp (Expectation ectation Let θ (0)bdenote alue of thecon parameters at the beginning of the step. Set q( h (i) | v) = p (h(i) | v (i); θ (0)) for all indices θt to denote E-step (Expexamples ectation step): value of the of the training wan ant train onthe (b (both oth batch andparameters minibatch iThe v(i) we Let q ( h ) = p ( h ; θ atariants the beginning of the step. Set ) for indicest v v • v are valid). By this we mean q is defined in terms of theallcurren current e wθ anthen t to andbut minibatch i of the training | train | willbatch p ( h |on θ)oth q (h | v ) v; (b parameter value examples of θ(0) ; if vwe vwary change q curren t v ariants are v alid). By this we mean is defined in terms of the (0) will remain equal to p(h | v ; θ ). θ θ p ( ; θ q ( ) h v h v parameter value of ; if we vary then ) will change but to p(h v ;step): θ ). Completely| or partially maximize • will Theremain M-stepequal (Maximization | X | The M-step (Maximization step): LCompletely (v (i), θ, q) or partially maximize (19.8) • i (19.8) 637(v , θ, q ) L X
CHAPTER 19. APPROXIMATE INFERENCE
with resp respect ect to θ using your optimization algorithm of choice. withcan resp to ed θ using your optimization algorithmtoofmaximize choice. L. On one This beect view viewed as a co coordinate ordinate ascent algorithm step, we maximize L with resp respect ect to q , and on the other, we maximize L with This can b e view ed as a co ordinate ascent algorithm to maximize . On one resp respect ect to θ. step, we maximize with respect to q , and on the other, we maximize with L Stoc hastic ascent on latent variable mo models dels can be seen as a sp special ecial respSto ectcto θ. gradient L L case of the EM algorithm where the M step consists of taking a single gradient Sto chastic gradient ascent latent variable mo be seen as Faorspsome ecial step. Other variants of the EMon algorithm can mak make e dels muc much hcan larger steps. case of families, the EM the algorithm the M consists of taking ,ajumping single gradient mo model del M stepwhere can even be step performed analytically analytically, all the step. Other v ariants of the EM algorithm can mak e muc h larger steps. F or some way to the optimal solution for θ given the current q. model families, the M step can even be performed analytically, jumping all the Even the E-stepfor in inv vθolv olves es exact inference, wayEv toen thethough optimal solution given the current q. we can think of the EM algorithm as using approximate inference in some sense. Specifically Specifically,, the M-step Ev en though the E-step in v olv es exact inference, w e think the EM θ . This assumes that the same value of q can be used for all values of can willofintroduce using approximate inference sense. Specifically the further M-step p(v) as in L and aalgorithm gap bet etw was een the true log thesome M-step mov moves es further ,and q θ assumes that the same v alue of can b e used for all v alues of . This will introduce away from the value θ (0) used in the E-step. Fortunately ortunately,, the E-step reduces the loglo p ( v a gap etween truethe ) as the M-step moves further and further gap tobzero again and as wthe e enter for the next time. loop op away from the vLalue θ used in the E-step. Fortunately, the E-step reduces the EMagain algorithm contains tains few insights. gapThe to zero as wecon enter thea lo op different for the next time. First, there is the basic structure of the learning process, in whic which h we up update date the mo model del parameters to The EM algorithm con tains a few different insights. First, there is the basic impro improv ve the likelihoo likelihood d of a completed dataset, where all missing variables hav havee structure of pro thevided learning in ofwhic update the model This parameters to their values provided by anprocess, estimate thehpwe osterior distribution. particular impro d the of aEM completed dataset, where all missing variables havtoe insigh insighttveis the not likelihoo unique to algorithm. For example, using gradient descent their values videdeliho by oan estimate of the posterior distribution. Thisodparticular maximize thepro log-lik log-likeliho elihoo d also has this same prop property; erty; the log-lik log-likeliho eliho elihoo gradient insigh t is not unique to the EM algorithm. F or example, using gradient descent to computations require taking exp expectations ectations with resp respect ect to the posterior distribution the log-lik elihoAnother od also has sameinprop log-likeliho od gradient omaximize ver the hidden units. keythis insight theerty; EM the algorithm is that we can computations require taking ectations withhav resp ectvto q ev con contin tin tinue ue to use one value of exp even en after we have e mo mov ed the to aposterior differentdistribution value of θ. over particular the hiddeninsight units.isAnother key insight in themachine EM algorithm welarge can This used throughout classical learning istothat derive q θ. con tin ue to use one v alue of ev en after we hav e mo v ed to a different v alue of M-step up updates. dates. In the context of deep learning, most mo models dels are to too o complex This particular insightsolution is used throughout classical learning to this derive large to admit a tractable for an optimal largemachine M-step up update, date, so second M-step updates. In the context of deep learning, is most moused. dels are too complex insigh insightt whic which h is more unique to the EM algorithm rarely to admit a tractable solution for an optimal large M-step update, so this second insight which is more unique to the EM algorithm is rarely used.
19.3
MAP Inference and Sparse Co Coding ding
We usually use theInference term inferenceand to refer to computing the probability distribution 19.3 MAP Sparse Coding over one set of variables giv given en another. When training probabilistic mo models dels with W e usually use the term inference to refer to computing the probability distribution laten latentt variables, we are usually interested in computing p( h | v ). An alternative over one set of variables given another. training probabilistic models with form of inference is to compute the singleWhen most likely value of the missing variables, v ). An p( hvalues. latent than variables, wethe areentire usually interestedov inercomputing rather to infer distribution over their possible In alternative the context form of inference is to compute the single most likely value of the | missing variables, rather than to infer the entire distribution over their possible values. In the context 638
CHAPTER 19. APPROXIMATE INFERENCE
of latent variable mo models, dels, this means computing ∗ of latent variable models, thishmeans = argcomputing max p(h | v ).
(19.9)
h
(19.9) h = arg max p(h v ). This is known as maximum a posteriori inference, abbreviated MAP inference. | MAP inference is usually not thought of as approximate do does es This is known as maximum a posteriori inference, abbreviated inference—it MAP inference. ∗ compute the exact most likely value of h . Ho How wev ever, er, if we wish to develop a MAP inference is usually not thought of as approximate does learning pro process cess based on maximizing L(v, h, q ), then it is helpfulinference—it to think of MAP h compute the exact most likely v alue of . Ho w ev er, if we wish to develop inference as a pro procedure cedure that provides a value of q. In this sense, we can think ofa (v, bhecause , q ), then learning processas based onximate maximizing it es is helpful to think MAP MAP inference appro approximate inference, it do does not pro provide vide the of optimal as a procedure that providesLa value of q. In this sense, we can think of qinference . MAP inference as approximate inference, because it does not provide the optimal q. Recall from Sec. 19.1 that exact inference consists of maximizing Recall from Sec. 19.1 inference L(vthat , θ, q)exact = Eh∼q [log p(hconsists , v)] + Hof(qmaximizing ) (19.10) E (v,unrestricted θ, q) = [log p(hof , vprobability )] + H (q) distributions,(19.10) with resp respect ect to q over an family using an exact optimizationLalgorithm. We can derive MAP inference as a form of q over b with resp ect to an unrestricted family distributions, using q ma appro approximate ximate inference y restricting the familyofofprobability distributions may y be drawn an exact optimization algorithm. We on can derivedistribution: MAP inference as a form of from. Sp Specifically ecifically ecifically, , we require q to take a Dirac approximate inference by restricting the family of distributions q may be drawn from. Specifically, we require qq(to onδ (ahDirac h take | v) = − µ).distribution: (19.11) q(h v)q = δtirely (h µ ). µ. Dropping terms of (19.11) This means that we can now control en entirely via L that | the optimization − do not vary with µ, we are left with problem This means that we can now control q entirely via µ. Dropping terms of that ∗ with the optimization problem do not vary with µ, we are µ left = max log p(h = µ, v), (19.12) L µ
µ = max log p(h = µ, v), whic which h is equiv equivalent alent to the MAP inference problem
(19.12)
∗ which is equivalent to the MAPh inference = max pproblem (h | v ).
(19.13)
h
h = max p(h v ). (19.13) We can thus justify a learning pro procedure cedure similar to EM, in which we alternate | bet etw ween performing MAP inference to infer h∗ and then up update date θ to increase W e can thus justify a learning pro cedure similar to EM, in which alternate ∗ log p(h , v). As with EM, this is a form of co coordinate ordinate ascent on Lw, ewhere we h θ b et w een p erforming MAP inference to infer and then up date to increase alternate betw between een using inference to optimize L with respect to q and using log p ( h , v ) . As with EM, thisL iswith a form of co ascent where can we θ. The pro parameter up updates dates to optimize resp respect ect to ordinate procedure cedureonas a, whole alternate usingthat inference to to case using Lq and log p(vrespect L is a lo b e justifiedbetw by een the fact low weroptimize bound on with ). In the of MAP θ parameterthis updates to optimize with respectbto cedure as a wholelo can L. The inference, justification is rather vacuous, ecause thepro bound is infinitely loose, ose, log p ( v b e justified by the fact that is a lo w er b ound on ) . In the case of MAP L due to the Dirac distribution’s differential entrop entropy y of negative infinity infinity.. Ho How wev ever, er, inference, this justification is rather v acuous, b ecause the b ound is infinitely lo ose, L adding noise to µ would make the bound meaningful again. due to the Dirac distribution’s differential entropy of negative infinity. However, 639 meaningful again. adding noise to µ would make the bound
CHAPTER 19. APPROXIMATE INFERENCE
MAP inference is commonly used in deep learning as both a feature extractor and a learning mechanism. It is primarily used for sparse co coding ding mo models. dels. MAP inference is commonly used in deep learning as both a feature extractor Recall from Sec. 13.4 that sparse co coding ding is a linear factor mo model del that imp imposes oses a and a learning mechanism. It is primarily used for sparse coding models. sparsit sparsity-inducing y-inducing prior on its hidden units. A common choice is a factorial Laplace Recall prior, withfrom Sec. 13.4 that sparse coding is a linear factor model that imposes a sparsity-inducing prior on its hidden units. choice is a factorial Laplace λ −A1 λcommon |hi| 2 p ( h ) = e . (19.14) i prior, with 2 λ p(h ) =by peerforming . (19.14) The visible units are then generated a linear transformation and 2 adding noise: The visible units are thenp(generated erforming transformation and x | h) = Nby (x;pW h + b, βIa)linear . (19.15) adding noise: p(x hting ) = p (h(x| ;vW + b, βI ). Ev (19.15) hi Computing or even represen representing ) ish difficult. Every ery pair of variables and hj are both parents of v . | This means that when v is observ observed, ed, the graphical N h Computing represen ting p (h vh) isand difficult. Ev ery pair of variables h mo model del contains or aneven active path connecting . All of the hidden units thus i j and h are bin oth parents ofe vclique . Thisinmeans when isdel observ the graphical p(h| | that v). If participate one massiv massive the vmo model wereed,Gaussian then h h mo del contains an active path connecting and . All of the hidden units these interactions could be mo modeled deled efficien efficiently tly via the cov covariance ariance matrix, butthus the p ( h v participate in one massiv e clique in ) . If the mo del were Gaussian then sparse prior makes these interactions non-Gaussian. these interactions could be modeled efficien | tly via the covariance matrix, but the p ( h | v Because ) is intractable, so is the computation of the log-likelihoo log-likelihood d and sparse prior makes these interactions non-Gaussian. its gradient. We thus cannot use exact maximum likelihoo likelihood d learning. Instead, we p ( h v Because ) is intractable, so is the computation of d and use MAP inference and learn the parameters by maximizing the the log-likelihoo ELBO defined by its gradient. W e thus cannot use exact maximum likelihoo d learning. Instead, we | the Dirac distribution around the MAP estimate of h. use MAP inference and learn the parameters by maximizing the ELBO defined by If we concatenate all of the h vectors in the training set into a matrix H , then the Dirac distribution around the MAP estimate of h. the sparse co coding ding learning pro process cess consists of minimizing If we concatenate all of the h vectors in the training set into a matrix H , then the sparse coding learning proX cess consistsX of minimizing > 2 J (H , W ) = . (19.16) |Hi,j | + X − HW i,j
i,j
i,j
J (H , W ) = . (19.16) H + X HW Most applications of sparse co coding ding involve olve weigh weight | also | inv − t decay or a constraint on W the norms of the columns of , in order to prev preven en entt the pathological solution with Most applications sparse ding extremely small H ofand largeco W . also involve weight decay or a constraint on ent the pathological the norms of the columns of WX , in order toX prev solution with We can minimize J by alternating betw etween een minimization with resp respect ect to H extremely small H and large W . and minimization with resp respect ect to W . Both sub-problems are conv convex. ex. In fact, H e can minimize by ect alternating betwaeen minimization respect to the W minimization withJresp respect to W is just linear regression with problem. How Howev ev ever, er, W and minimization with resp ect to . Both sub-problems are conv ex. In fact, minimization of J with resp respect ect to both argumen arguments ts is usually not a con conv vex problem. the minimization with respect to W is just a linear regression problem. However, Minimization resp respect ect to to bH sp specialized as the minimization of J with with resp ect othrequires argumen tsecialized is usuallyalgorithms not a convsuch ex problem. feature-sign searc search h algorithm (Lee et al., 2007). Minimization with respect to H requires specialized algorithms such as the feature-sign search algorithm (Lee et al., 2007). 640
CHAPTER 19. APPROXIMATE INFERENCE
19.4
Variational Inference and Learning
L (v, θ , q) is a lo W e hav havee V seen ho how w the evidence lo lower wer and boundLearning low wer b bound ound on 19.4 ariational Inference log p(v ; θ), ho how w inference can be view viewed ed as maximizing L with resp respect ect to q , and ( v , θ , q W e hav e seen ho w the evidence lo wer bound ) is a lo w er ound on ho how w learning can be view viewed ed as maximizing L with resp respect ect to θ. We bhav have e seen log p ( v ; θ q ) , ho w inference can b e view ed as maximizing with resp ect to , and L that the EM algorithm allows us to make large learning steps with a fixed q and θ. W ho w learning can b e view ed as maximizing with resp ect e havae pseen that learning algorithms based on MAP inference allo allow wL us totolearn using oin ointt q that the EM algorithm allows us to make large learning steps with a fixed and L estimate of p(h | v) rather than inferring the entire distribution. No Now w we develop that learning algorithms based on MAP inference allo w us to learn using a point the more general approach to variational learning. estimate of p(h v) rather than inferring the entire distribution. Now we develop idea approach behind variational learning is that we can maximize L over a the The morecore general to variational learning. | restricted family of distributions q . This family should be chosen so that it is easy The coreEidea behind variational learning is that we can maximize over a to compute q log p (h, v ). A typical way to do this is to introduce assumptions restricted family of distributions q . This family should be chosen so that L it is easy ab about out ho how w qEfactorizes. to compute log p (h, v ). A typical way to do this is to introduce assumptions A common approach to variational learning is to imp impose ose the restriction that q about how q factorizes. is a factorial distribution: A common approach to variational learning is to impose the restriction that q Y (19.17) q(hi | v). is a factorial distribution: q(h | v) = i
q(h v) = (19.17) q(h v). This is called the me mean an field approach. More generally generally,, we can imp impose ose any graphical | flexibly determine | mo model del structure we cho hoose ose on q, to how man many y in interactions teractions we called This theximation mean field Morefully generally , wegraphical can impose any graphical w an antt isour appro approximation to approach. capture. This general mo model del approac approach h q mo del structure w e c ho ose on , to flexibly determine how man y in teractions we is called structur structureed variational infer inferenc enc enceeY (Saul and Jordan, 1996). want our approximation to capture. This fully general graphical model approach The beauty of the variational approac approach h is that we do not need to sp specify ecify a is called structured variational inference (Saul and Jordan, 1996). q sp parametric form for . W e sp how it should factorize, but then the specific ecific specify ecify The b eauty of the v ariational approac h is that w e do not need to sp ecify a optimization problem determines the optimal probabilit probability y distribution within those q . discrete specific parametric form for We specify how it shouldthis factorize, but then factorization constraints. For latent variables, just means that the we optimization problem determines the optimal probabilit y distribution within those use traditional optimization techniques to optimize a finite num number ber of variables factorization ForFdiscrete latent v ariables, this just means we q distribution. describing theconstraints. or con latent v ariables, this means that that we contin tin tinuous uous use atraditional to optimize a finite beroptimization of variables use branch of optimization mathematics techniques called calculus of variations to pnum erform theofqfunctions, distribution. or contindetermine uous latentwhich variables, this should means b that we odescribing ver a space andFactually function e used use a branch of mathematics called calculus of v ariations to p erform optimization to represent q . Calculus of v variations ariations is the origin of the names “v “variational ariational o v er a space of functions, and actually determine which function should be used learning” and “v “variational ariational inference,” though these names apply even when the q . are to represent Calculus variations origin of the needed. names “v laten latent t variables discreteofand calculusisofthe variations is not Inariational the case learning” and latent “variational inference,” these names apply when the of con contin tin tinuous uous variables, calculusthough of variations is a pow erful even techn technique ique that laten t v ariables are discrete and calculus of v ariations is not needed. In the case remo remov ves muc much h of the resp responsibility onsibility from the human designer of the mo model, del, who of con tin uous latent v ariables, calculus of v ariations is a p o w erful techn ique that no now w must sp specify ecify only how q factorizes, rather than needing to guess how to design ves muc h ofcan theaccurately responsibility from thethe human designer of the model, who aremo sp specific ecific q that approximate posterior. now must specify only how q factorizes, rather than needing to guess how to design Because L(v, θ , q ) is defined to be log p(v; θ ) − DKL (q( h | v )kp(h | v ; θ)) )),, we a specific q that can accurately approximate the posterior. can think of maximizing L with resp respect ect to q as minimizing DKL (q(h | v )kp(h | v )) )).. Because (v, θ , q ) is defined to be log p(v; θ ) D (q( h v ) p(h v ; θ)), we D | (q(kh v |) p(h v )). can think of L maximizing with respect 641 to q as minimizing − L | k |
CHAPTER 19. APPROXIMATE INFERENCE
In this sense, we are fitting q to p . Ho How wev ever, er, we are doing so with the opp opposite osite direction of the KL div divergence ergence than we are used to using for fitting an appro approximation. ximation. q toopd. learning In this we sense, are fitting However,towe doing so data, with we the minimize opposite When use we maximum lik likeliho eliho elihoo fit are a mo model del to direction of the KL divergence than we are used to using for fitting an approlikeliho ximation. D eliho elihoood KL(pdatakpmodel). As illustrated in Fig. 3.6, this means that maximum lik When we use maximum lik eliho o d learning to fit a mo del to data, we minimize encourages the mo model del to hav havee high probability everywhere that the data has high D (p y,p while). our As illustrated in Fig. 3.6,inference this means that maximum likeliho q otod probabilit probability optimization-based pro procedure cedure encourages encourages to have highthe probability everywhere that the data has high k the model everywhere ha hav ve low probability true posterior has lo low w probability probability. . Both probabilityof, the while optimization-based inference cedure encourages to directions KLour divergence can ha hav ve desirable and pro undesirable prop properties. erties. qThe ha v e low probability everywhere the true p osterior has lo w probability . Both choice of whic which h to use depends on which prop properties erties are the highest priority for directions of the KLIndivergence can hainference ve desirable and undesirable properties. The eac each h application. the case of the optimization problem, we choose choice of whic h to use depends on which properties are the highest, priority for to use D Specifically ecifically ecifically, computing KL(q (h | v )kp (h | v )) for computational reasons. Sp each(application. In theinvolv case the inference optimization weq ,choose D olv olves esofev evaluating aluating exp expectations ectations withproblem, resp respect ect to so by KL q(h | v) kp(h | v )) inv D ( q ( ) ( h v p h v to use )) for computational reasons. Sp ecifically , computing designing q to be simple, we can simplify the required exp expectations. ectations. The opp opposite osite D (q(h ofvthe ) p(KL )) inv evaluating ectationsexp with respectwith to qresp ,respect so ect by |h v kdivergence | olveswould direction requireexp computing expectations ectations designing be simple, we can the simplify required exp ectations. The opposite |q to | to the true pkosterior. Because form the of the true p osterior is determined by direction of the KL divergence w ould require computing exp ectations with resp ect the choice of mo model, del, we cannot design a reduced-cost approach to computing to the true p osterior. exactly exactly.. the form of the true posterior is determined by D KL(p(h | v )kq(h | v)) Because the choice of model, we cannot design a reduced-cost approach to computing D (p(h v ) q(h v)) exactly. 19.4.1 |Discrete Latentt Variables k | Laten
V ariational inferenceLaten with discrete latent variables is relatively straightforw straightforward. ard. 19.4.1 Discrete t Variables We define a distribution q, typically one where each factor of q is just defined Vyariational discrete latent ariables is relatively b a lo lookup okup inference table ov over erwith discrete states. In v the simplest case, h is straightforw binary and ard. we q q W e define a distribution , typically one where each factor of is just defined mak makee the mean field assumption that q factorizes ov over er eac each h individual hi . In this h is binary and by a w loeokup table over discrete simplest we ˆ the q with states. case can parametrize a vectorInh whose en entries triescase, are probabilities. Then assumption that q factorizes over each individual h . In this qmak (hi e=the 1 | mean v) = hˆfield i. h whose entries are probabilities. Then case we can parametrize q with a vector ˆ how to represent q , we simply optimize its parameters. In q(h After = 1 determining v) = hˆ . the case of discrete latent variables, this is just a standard optimization problem. q , wwith After| determining how represent e simply optimize its parameters. In In principle the selection of qtocould be done any optimization algorithm, such thegradient case of discrete as descent.latent variables, this is just a standard optimization problem. In principle the selection of q could be done with any optimization algorithm, such Because this optimization must occur in the inner lo loop op of a learning algorithm, as gradient descent. it must be very fast. To ac achiev hiev hievee this sp speed, eed, we typically use sp special ecial optimization Becausethat this are optimization must oeccur in the innersmall loop and of a simple learningproblems algorithm, algorithms designed to solv solve comparatively in it m ust b e very fast. T o ac hiev e this sp eed, we typically use sp ecial optimization very few iterations. A popular choice is to iterate fixed point equations, in other algorithms that are designed to solve comparatively small and simple problems in w ords, to solve very few iterations. A popular choice∂ is to iterate fixed point equations, in other L=0 (19.18) words, to solve ˆi ∂h ∂ =0 (19.18) ˆ ˆ until for hi . We rep repeatedly eatedly up update date different til we satisfy a conv convergence ergence ∂ˆ helements of h un L ˆ h until we satisfy a convergence for h . We repeatedly update different elements of ˆ 642
CHAPTER 19. APPROXIMATE INFERENCE
criterion. To mak makee this more concrete, we show how to apply variational inference to the criterion. binary sp sparse arse coding mo model del (w (wee present here the mo model del developed by Henniges et al. T o mak e this more concrete, we show how to apply variational inference the (2010) but demonstrate traditional, generic mean field applied to the mo model, del,to while binaryintroduce sparse coding model (wealgorithm). present hereThis the mo delation developed by Henniges et al. they a sp specialized ecialized deriv derivation go goes es into considerable (mathematical 2010) but demonstrate traditional, generic mean fieldwho applied to the del,resolv whilee detail and is intended for the reader wishes to mo fully resolve they introduce specialized algorithm). derivation goes into inference considerable an any y ambiguit ambiguity y ina the high-lev high-level el conceptual This description of variational and mathematical detail and is intended for the reader who wishes to fully resolv learning we hav havee presen presented ted so far. Readers who do not plan to derive or implemen implementet an y ambiguit y in the high-lev el conceptual description of v ariational inference and variational learning algorithms ma may y safely skip to the next section without missing learning e have presen ted so Readers far. Readers do not plan derive sparse or implemen an any y new w high-level concepts. who who pro proceed ceed with thetobinary co coding dingt vexample ariationalare learning algorithms may safely skip the next sectionofwithout missing encouraged to review the list of to useful prop properties erties functions that an y new high-level concepts. Readers who pro ceed with the binary sparse co ding commonly arise in probabilistic mo models dels in Sec. 3.10. We use these prop properties erties example are encouraged to review the list of useful prop erties of functions that lib liberally erally throughout the following deriv derivations ations without highlighting exactly where commonly w e use eacharise one. in probabilistic models in Sec. 3.10. We use these properties liberally throughout the following derivations without highlighting exactly where n v ∈ R In the binary sparse co coding ding mo model, del, the input is generated from the we use each one. mo model del by adding Gaussian noise to the sum of m differen different components onents which R t comp v In the binary sparse co ding mo del, the input is generated from can eac each h be present or absen absent. t. Eac Each h comp component onent is switched on or off by the the mo del by adding Gaussian t components which ∈ corresp corresponding onding hidden unit innoise h ∈ {to 0, 1the }m: sum of m differen can each be present or absent. Each component is switched on or off by the corresponding hidden unit in h p(hi0,=1 1) := σ(bi ) (19.19) ∈{ } σ(bh,)β−1 ) (19.19) (19.20) p(v |ph(h) ==N1)(v=; W β is a where b is a learnable set of biases, learnable eightt matrix, and(19.20) p(v h) =W (isv;aW h, β ) weigh learnable, diagonal precision matrix. | N is a learnable weight matrix, and β is a W where b is a learnable set of biases, Training this mo model del with matrix. maximum likelihoo likelihood d requires taking the deriv derivative ative learnable, diagonal precision with resp respect ect to the parameters. Consider the deriv derivativ ativ ativee with resp respect ect to one of the Training this model with maximum likelihood requires taking the derivative biases: with respect to the parameters. Consider the derivative with respect to one of the biases: ∂ log p(v) (19.21) ∂ bi ∂∂ (19.21) plog (v)p(v) ∂b = ∂ bi (19.22) p(v) pP (v) ∂ = (19.22) h p(h, v ) = ∂bpi (v) (19.23) p(v) P p(h, v) ∂ = (19.23) h p(h)p(v | h) = ∂bi p(v) (19.24) p(v) p(h)p(v h) = (19.24) | P p(v) 643
P
CHAPTER 19. APPROXIMATE INFERENCE
h1
h2
v1
h3
v2
h4 h1
h3
h2
h4
v3
Figure 19.2: The graph structure of a binary sparse co coding ding mo model del with four hidden units. (L (Left) eft) The graph structure of p( h, v). Note that the ed edges ges are directed, and that every tw two o Figure 19.2: graph structure of avisible binaryunit. sparse co ding mograph del with four hidden hidden units The are co-parents of every (Right) The structure of p ( hunits. | v ). (L p( h, v)paths Thetograph structure of active . Notebthat theco-parents, edges are directed, and that every two Ineft) order account for the et etween ween the p osterior distribution p ( h v ). (Right) hidden units are co-parents of every visible unit. The graph structure of needs an edge b et etween ween all of the hidden units. In order to account for the active paths b etween co-parents, the p osterior distribution | needs an edge b etween all of the hidden units. P
| h) ∂b∂ i p(h) (19.25) p(v) p(v h) p(h) ∂ =X (19.25) p(h) = p(h p||(vv)) ∂bi (19.26) p(ph()h) h = p(h v∂) (19.26) p(hp)(h). =EP log (19.27) h∼p(h|v)| ∂ bi ∂ E = log p(h). (19.27) ∂ bwith resp This requires computing exp expectations ectations respect ect to p ( h | v). Unfortunately Unfortunately,, X p(h | v) is a complicated distribution. See Fig. 19.2 for the graph structure of h the v). complete requires expectations with resp ect onds to p ( to Unfortunately p(h,This v) and p(h | vcomputing ). The posterior distribution corresp corresponds graph, p ( h v ) is a complicated distribution. See Fig. 19.2 for the graph structure of over the hidden units, so variable elimination algorithms do not| help us to compute p(h,required h ectations v ). The pany osterior |v) and p(exp the expectations fasterdistribution than brutecorresp force. onds to the complete graph over the hidden units, so variable elimination algorithms do not help us to compute | We can resolve this difficult difficulty y by using variational inference and variational the required expectations any faster than brute force. learning instead. We can resolve this difficulty by using variational inference and variational We can make a mean field approximation: learning instead. Y We can make a mean field approximation: q(hi | v). q(h | v) = (19.28) =
h p(v
i
q(h v). q(h v) = (19.28) The laten latentt variables of the binary sparse co coding ding mo model del are binary binary,, so to represen representt | | a factorial q we simply need to mo model del m Bernoulli distributions q( hi | v). A natural latent variables of theof binary sparse coding model are is binary represen ˆ h oft wayThe to represent the means the Bernoulli distributions with, so a vtoector q( hthat v).h a factorial q wewith simply to modelˆm Bernoulli distributions natural ˆA q( hneed probabilities, We imp impose ose a restriction i = 1 | v ) = h i. Y i is never ˆ of w ay totorepresent meanstoofavthe distributions with | a vector hi. equal 0 or to 1,the in order oid Bernoulli errors when computing,isfor example, loghˆ ˆ . We impose a restriction that h ˆ is never probabilities, with q( h = 1 v ) = h ˆ h We will see that the variational inference equations never assign 0 or 1 to h.i equal to 0 or to 1, in order to| avoid errors when computing, for example, log ˆ 644 h We will see that the variational inference equations never assign 0 or 1 to ˆ
CHAPTER 19. APPROXIMATE INFERENCE
analytically analytically.. Ho How wev ever, er, in a softw software are implemen implementation, tation, machine rounding error could result in 0 or 1 values. In softw software, are, we may wish to implement binary sparse analytically . Ho ever, in a softw are implemen tation, machine rounding error h ˆcould co coding ding using anwunrestricted vector of variational parameters via z and obtain result in 0 or 1 v alues. In softw are, we may wish to implement binary sparse ˆ ˆ the relation h = σ (z ). We can thus safely compute log hi on a computer by ˆusing coding using anσunrestricted of variational parameters z and obtain h via the identit identity y log (zi ) = −ζ (−zvector the softplus. i) relating the sigmoid and ˆ ˆ the relation h = σ (z ). We can thus safely compute log h on a computer by using To begin our deriv derivation ation of variational learning in the binary sparse co coding ding the identity log σ (z ) = ζ ( z ) relating the sigmoid and the softplus. mo model, del, we show that the use of this mean field appro approximation ximation makes learning − −of variational learning in the binary sparse coding T o b egin our deriv ation tractable. model, we show that the use of this mean field approximation makes learning The evidence low lower er bound is given by tractable. LThe (v, θevidence , q) (19.29) lower bound is given by =Eh∼q [log p(h, v)] + H (q) (19.30) (v, θ , q) (19.29) =E [log p ( h ) + log p ( v | h ) − log q ( h | v )] (19.31) h∼q = L [log (19.30) " mp(h, v)] + H (qn) # m E X X X =E [log p(log h) p+(hlog p(v h ) p(log (19.31) = log v i |qh(h (19.32) ) −v)] log q(hi | v) h∼q i) + − | i=1 i=1 i=1| E =X log p(h ) + log p(v h) log q(h v) (19.32) m h i ˆ ˆ ˆ ˆ (19.33) = h i(log σ(bi ) − log hi ) + (1 − h| i )(log − σ(−bi ) − log(1 | − h i )) i=1
" nσ(b ) r log hˆ ) + (1 hˆ )(log σ( b)# log(1 # hˆ )) h"ˆ (log X βi βi + E h∼qX log − X exp − − (vi − WX )2 − − i,: h− 2π 2 i=1 β β E +m h log exp (v W h) i X 2π ˆ 2 ˆ h ˆ ˆ X = h i(log σ(bi ) − logh i ) + − (1 − h i )(log − σ(−bi ) − log(1 − h i ))i =
i=1
(19.33) (19.34) (19.34) (19.35)
σ(b ) loghˆ ) + (1 hˆ )(log σ( b ) log(1 hˆ )) hˆn(log (19.35) " # r X X X 1 βi − 2 2 ˆ ˆ −Wi,j − X − βi vi − 2v− + log hj + − W i,j Wi,k ˆ h jˆ hk . iW i,:h + 2 2π j k6=j 1 i=1 β β v 2v W ˆ + log h+ W hˆ + W W ˆ hˆ h (19.36) . i 2π X2h − − (19.36) While these equations are somewhat unappealing aesthetically aesthetically,, they show that L can be expressed in a small num number ber of simple arithmetic op operations. erations. The Xthese equations are somewhat X X While unappealing aesthetically , they show evidence low lower is therefore tractable. We can use L as a replacement er bound L for that can b e expressed in a small num ber of simple arithmetic operations. The the intractab intractable le log-likelihoo log-likelihood. d. evidence L lower bound is therefore tractable. We can use as a replacement for In principle, we could simply run gradient ascent on both v and h and this the intractable log-likelihoo d. L L would make a perfectly acceptable com combined bined inference and training algorithm. In principle, simply run gradient ascent on boththis this v and h and Usually Usually, , how howev ev ever, er,wewcould e do not do this, for tw two o reasons. First, would require w ould make perfectly bined inferencethat and do training algorithm. ˆh foraeac storing each h v . Wacceptable e typicallycom prefer algorithms not require perUsually , how ev er, w e do not do this, for tw o reasons. First, this w ould require example memory memory.. It is difficult to scale learning algorithms to billions of examples ˆ for each v . We typically prefer algorithms that do not require perh storing if we must remem rememb ber a dynamically up updated dated vector asso associated ciated with eac each h example. example memory. It is difficult to scale learning algorithms to billions of examples 645 if we must remember a dynamically updated vector associated with each example. =
CHAPTER 19. APPROXIMATE INFERENCE
h very quic Second, we would lik likee to be able to extract the features ˆ quickly kly kly,, in order to recognize the conten contentt of v . In a realistic deploy deployed ed setting, we would need to be h very quickly, in order to Second, we wouldˆhlik to btime. e able to extract the features ˆ able to compute ine real v recognize the content of . In a realistic deployed setting, we would need to be both these reasons, we typically do not use gradient descen descentt to compute ˆh in ableFor to compute real time. ˆ the mean field parameters h . Instead, we rapidly estimate them with fixed point For both these reasons, we typically do not use gradient descent to compute equations. ˆ . Instead, we rapidly estimate them with fixed point the mean field parameters h The idea behind fixed point equations is that we are seeking a lo local cal maximum equations. ˆ ˆ with resp respect ect to h , where ∇h L( v, θ , h) = 0. We cannot efficiently solv solvee this The idea b ehind fixed p oint equations is that we are seeking a lo cal maximum ˆ equation with resp respect ect to all of h sim simultaneously ultaneously ultaneously.. How Howev ev ever, er, we can solve for a single ˆ ˆ with respect to h , where ( v, θ , h) = 0. We cannot efficiently solve this v ariable: ˆLsim ∂ ultaneously equation with respect to all ∇ of h . However, we can solve for a single h) = 00.. (19.37) L(v, θ , ˆ ˆ variable: ∂h i ∂ h) = 0. (19.37) (v , θ , ˆ We can then iterativ iteratively ely apply ˆ the solution to the equation for i = 1, . . . , m , ∂h L a con and rep repeat eat the cycle until we satisfy conv verge criterion. Common conv convergence ergence i = 1 , .,m, W e can then iterativ ely apply the solution to the equation for criteria include stopping when a full cycle of up updates dates do does es not impro improv ve L by. .more and rep eat the cycle until we satisfy a con v erge criterion. Common conv ergence h by more than than some tolerance amount, or when the cycle do does es not change ˆ criteria include stopping when a full cycle of up dates does not improve by more some amount. h by L than some tolerance amount, or when the cycle does not change ˆ more than Iterating mean field fixed p oint equations is a general technique that can some amount. pro provide vide fast variational inference in a broad variet ariety y of mo models. dels. To make this more Iterating mean field fixed p oint equations is a general technique concrete, we show ho how w to derive the up updates dates for the binary sparse co coding ding that mo model delcan in provide fast variational inference in a broad variety of models. To make this more particular. concrete, we show how to derive the updates for the binary sparse coding moˆdel in First, we must write an expression for the deriv derivativ ativ atives es with resp respect ect to h i . To particular. do so, we substitute Eq. 19.36 into the left side of Eq. 19.37: First, we must write an expression for the derivatives with respect to hˆ . To do so, we substitute Eq. 19.36 into the left side of Eq. 19.37: ∂ ˆ) L(v, θ , h (19.38) ˆ ∂h i ∂ ˆ) (X vm, θh, h i (19.38) ˆ ∂ ∂h = hˆj (log σ(bj ) − log hˆj ) + (1 − hˆj )(log σ(−bj ) − log(1 − hˆj)) (19.39) ˆi L ∂h ∂ j=1 ˆ h (log σ(b ) log hˆ ) + (1 hˆ )(log σ( b ) log(1 hˆ )) (19.39) = n ∂hˆ1 X X X − 2−ˆ − ˆ−− ˆk h ˆ l log βj − β j −v2j − 2vj Wj,:h Wj,k + hk + W j,k Wj,lh 2 2π l6=k k 1 j=1 β ˆ ˆ + + log v W W h W W ˆ hi ˆ h(19.40) β 2 v h 2 X h 2π − − − (19.41) = log σ(bi ) − logˆh i − 1 + log(1 − ˆ hi ) + 1 − log σ(−bi ) (19.40) n ˆ X (19.41) 1+ logˆ σ( b ) = logX σX (b ) logˆh h )+1 X 1 log(1 X 2 + βj vj W j,i − Wj,i − Wj,k W j,i hk (19.42) − − 2 − − − k = 6 i j=1 1 + β vW W W W ˆ h (19.42) 646 2 − − X
X
CHAPTER 19. APPROXIMATE INFERENCE
X 1 > =bi − log hˆi + log W:,j βW:,i ˆh j. (19.43) log(1 (1 − ˆhi) + v >βW :,i − W :,iβW :,i − 2 j 6=i 1 =b log hˆ + log(1 ˆh ) + v βW W βW W βW ˆ h . (19.43) 2 ˆ To−apply the fixed−point up update date inference rule, we − solve for the hi that sets Eq. − 19.43 to 0: fixed point update inference rule, we solve for the ˆ h that sets Eq. To apply the X X 1 19.43 toˆ0: (19.44) h i = σ bi + v > βW:,i − W:,i βW:,i − W:,j> βW:,i ˆ hj . 2 j = 6 i 1 ˆ (19.44) h = σ b + v βW W βW W βW ˆ h . 2 At this poin oint, t, we can see that etween een recurrent − there is a−close connection betw neural netw networks orks and inference in graphical mo models. dels. Sp Specifically ecifically ecifically,, the mean field this equations poin can seeathat there neural is a close connection betw t, we defined fixedAtpoint recurrent netw network. ork. The task ofeen thisrecurrent netw network ork neural netw orks and inference in graphical mo dels. Sp ecifically , the mean field X is to perform inference. We hav havee describ described ed how to deriv derivee this network ork from a netw fixed p oint equations defined a recurrent neural netw ork. The task of this network. mo model del description, but it is also possible to train the inference netw network ork directly directly. is toeral perform inference. Wtheme e have are describ ed ed how derive 20 this Sev Several ideas based on this describ described in to Chapter . network from a model description, but it is also possible to train the inference network directly. In the case of binary sparse co coding, ding, we can see that the recurrent netw network ork Several ideas based on this theme are described in Chapter 20. connection specified by Eq. 19.44 consists of rep repeatedly eatedly up updating dating the hidden In based the case of binary sparse coding, can see that the recurrent netw ork units on the changing values of theweneigh oring hidden units. The input neighb b connection by Eq. of19.44 consists repeatedly the hidden v >β W alw alwa ays sendsspecified a fixed message to theofhidden units, up butdating the hidden units units based on the changing v alues of the neigh b oring hidden units. The input hi constan constantly tly up update date the message they send to each other. Sp Specifically ecifically ecifically,, tw two o units ˆ v β W alw a ys sends a fixed message of to the hidden units, but the hidden units ˆ and hj inhibit each other when their weigh weightt vectors are aligned. This is a form of ˆ h constan tly up date the message they send to each ecifically two units comp competition—b etition—b etition—bet et etw ween tw two o hidden units that both other. explainSpthe input,, only the one ˆ and hexplains inhibitthe each other when weigh vectors areactive. aligned. This is aetition form of that input best willtheir be allow allowed edt to remain This comp competition is comp etition—b et w een tw o hidden units that b oth explain the input, only the one the mean field appro approximation’s ximation’s attempt to capture the explaining aw away ay interactions that explains the input b est will b e allow ed to remain active. This competition is in the binary sparse co coding ding posterior. The explaining away effect actually should the mean approximation’s thesamples explaining away cause a mfield ulti-modal posterior,attempt so thattoifcapture we draw from theinteractions posterior, in the samples binary sparse coeding osterior. The explaining awwill ay effect should some will hav have onepunit activ active, e, other samples ha hav ve actually the other unit cause a m ulti-modal posterior, so that if we draw samples from the p osterior, activ active, e, but very few samples hav havee both active. Unfortunately Unfortunately,, explaining aw away ay some samples will hav e one unit activ e, other samples will ha v e the other unit in interactions teractions cannot be mo modeled deled by the factorial q used for mean field, so the mean activ e, but very few samples e both active. away field approximation is forced tohav choose one mo mode de Unfortunately to mo model. del. This, explaining is an instance of interactions be mo the beha ehavior viorcannot illustrated indeled Fig. by 3.6the . factorial q used for mean field, so the mean field approximation is forced to choose one mode to model. This is an instance of e can Eq. in 19.44 equivalent alent form that reveals some further the W beha viorrewrite illustrated Fig.into 3.6.an equiv insigh insights: ts: We can rewrite equivalent form that reveals some further Eq. 19.44 into an > insights: X 1 ˆh i = σ W :,j ˆhj βW:,i − W:,iβW :,i . (19.45) b i + v − 2 j 6=i 1 ˆh = σ b + v . (19.45) W ˆ h βW W βW 2 P In this reformulation, we see − the input at each step−as consisting of v − j6=i W:,jhˆj rather than v. We can thus think of unit i as attempting to enco encode de the residual see the input at each step as consisting W hˆ In this reformulation, we of v 647 i as attempting to enco de the residual rather than v. We can thusX think of unit − P
CHAPTER 19. APPROXIMATE INFERENCE
error in v giv given en the co code de of the other units. We can thus think of sparse co coding ding as an iterativ iterativee auto autoenco enco encoder, der, that rep repeatedly eatedly enco encodes des and deco decodes des its input, attempting v given in error thethe coreconstruction de of the other after units.each We iteration. can thus think of sparse coding as to fix in mistakes an iterative autoencoder, that repeatedly encodes and decodes its input, attempting In mistakes this example, hav havee derived after an up update date iteration. rule that up updates dates a single unit at to fix in thewereconstruction each a time. It would be adv advan an antageous tageous to be able to up update date more units sim simultaneously ultaneously ultaneously.. In graphical this example, we such have as derived an update mac rulehines, that up a singleinunit Some mo models, dels, deep Boltzmann machines, aredates structured suchat a a time. It w ould b e adv an tageous to b e able to up date more units sim ultaneously ˆ way that we can solve for many entries of h sim simultaneously ultaneously ultaneously.. Unfortunately Unfortunately,, binary . Some graphical mo such assuch deepblo Boltzmann mac hines, are such a sparse co coding ding do does es dels, not admit block ck up updates. dates. Instead, westructured can use a in heuristic ˆ simultaneously. Unfortunately, binary h w a y that we can solve for many entries of tec technique hnique called damping to perform blo blocck up updates. dates. In the damping approach, we sparse coding does not admit such vblo ck up Instead,t of weh usemov a heuristic ˆ can solv solve e for the individually optimal alues of dates. every elemen element , then move e all of tec hnique called damping to p erform blo c k up dates. In the damping approach, we the values in a small step in that direction. This approac approach h is no longer guaranteed ˆ solv e for the values elemen t ofyhmo , then alloller of L individually to increase at each step,optimal but works well of in every practice for man many models. dels.mov Seee K the v alues in a small step in that direction. This approac h is no longer guaranteed and Friedman (2009) for more information ab about out cho hoosing osing the degree of synchron synchrony y to increase each step, but works well inalgorithms. practice for many models. See Koller and damping at strategies in message passing and Friedman L (2009) for more information about choosing the degree of synchrony and damping strategies in message passing algorithms.
19.4.2
Calculus of Variations
19.4.2contin Calculus of our Variations Before continuing uing with presen presentation tation of variational learning, we must briefly in intro tro troduce duce an imp importan ortan ortantt set of mathematical to tools ols used in variational learning: Before contin uing with our presen tation of v ariational learning, we must briefly calculus of variations variations.. introduce an important set of mathematical tools used in variational learning: Man Many yofmac machine hine learning tec techniques hniques are based on minimizing a function J (θ ) by calculus variations . finding the input vector θ ∈ R n for whic which h it takes on its minimal value. This can (θ )the Man y mac hine learning tec hniques are based minimizing function by be accomplished with multiv multivariate ariate calculus and on linear algebra, aby solvingJfor R finding the input vector whic h itcases, takes we on actually its minimal vtalue. Thisfor can ∇θJθ (θ) = 0for critical poin oints ts where . In some wan want to solve a be accomplished with ariate calculus algebra, by solving forov the ∈ we f (x ), such function as multiv when wan want t to findand the linear probability density function over er J (θ) is=what 0. Incalculus critical points vwhere some cases, we actually wantus totosolve some random ariable. This of variations enables do. for a f ( x function ), such as∇when we want to find the probability density function over A function of a function f is known as a functional J[ f ]. Muc Much h as we can some random variable. This is what calculus of variations enables us to do. tak partial deriv of a function with resp to elements of its vector-v takee derivatives atives respect ect vector-valued alued f isderivatives [ f ]. Much derivatives A function of tak a function known as, aalso functional as we can, argumen argument, t, we can take e functional known asJvariational takae functional partial deriv a function with respect to elements of its vector-v J[atives f ] withofresp f ( x) atalued of respect ect to individual values of the function any argumen t, we can tak e functional derivatives , also known as variational derivatives sp specific ecific value of x . The functional deriv derivative ative of the functional J with resp respect ect to, J [ f of a functional ] with resp ect to individual v alues δ of the function f ( x) at any the value of the function f at poin ointt x is denoted δf (x) J . specific value of x . The functional derivative of the functional J with respect to A complete formal developmen developmentt of functional deriv derivatives atives is beyond the scop scopee of the value of the function f at point x is denoted J. this book. For our purp purposes, oses, it is sufficient to state that for differen differentiable tiable functions A complete formal developmen t of functional deriv atives is b eyond f (x) and differentiable functions g (y, x) with contin continuous uous deriv derivativ ativ atives, es, the thatscope of this book. For our purposes, it is sufficient to state that for differentiable functions Z ∂ f (x) and differentiableδfunctions g (y, x) with contin uous derivatives, that g (f (x), x) dx = g(f (x), x). (19.46) δf (x) ∂y δ ∂ g (f (x), x) dx = g(f (x), x). (19.46) δf (x) ∂y 648
Z
CHAPTER 19. APPROXIMATE INFERENCE
To gain some intuition for this identit identity y, one can think of f (x ) as being a vector with uncoun uncountably tably man many y elements, indexed by a real vector x. In this (somewhat f (xativ T o gain some intuition for this identity, one can think of ) as vector incomplete view), the identit identity y providing the functional deriv derivativ atives esbiseing theasame as x with uncoun tably man y elements, indexed by a real vector . In this (somewhat n we would obtain for a vector θ ∈ R indexed by positive integers: incomplete view), the identity providing the functional derivatives is the same as R ∂ θX we would obtain for a vector indexed ∂by positive integers: g(θj , j ) = g(θi , i). (19.47) ∂ θi ∈ ∂θi ∂ j ∂ g(θ , j ) = g(θ , i). (19.47) ∂ θ ∂ θ Man Many y results in other machine learning publications are presented using the more general Euler-L Euler-Lagr agr agrange ange equation whic which h allows g to dep depend end on the deriv derivatives atives of f Man y results in other machine learning publications are presented using theresults more as well as the value of f , but we do not need this fully general form for the generalted Euler-L agrbange Xwhich allows g to depend on the derivatives of f presen presented in this ook. equation as well as the value of f , but we do not need this fully general form for the results To ted optimize with resp respect ect to a vector, we take the gradient of the presen in thisabfunction ook. function with resp respect ect to the vector and solve for the poin ointt where every elemen elementt of T o optimize a function with resp ect to a vector, we take the gradient of the the gradient is equal to zero. Lik Likewise, ewise, we can optimize a functional by solving for function with resp ect to the v ector and solve for the p oin t where every elemen t of the function where the functional deriv derivative ative at every point is equal to zero. the gradient is equal to zero. Likewise, we can optimize a functional by solving for As an example of how this pro process cess works, consider the problem of finding the the function where the functional derivative at every point is equal to zero. probabilit probability y distribution function over x ∈ R that has maximal differen differential tial entrop entropy y. As an example of how this pro cess works, consider the problem of finding the Recall that the entrop entropy y of a probabilit probability y distribution p(x) is defined as R probability distribution function over x that has maximal differential entropy. Recall that the entropy of a probabilit y∈ H [p] = − Edistribution (19.48) x log p(x). p(x) is defined as E H [p] = is an logintegral: p(x). (19.48) For contin continuous uous values, the exp expectation ectation Z− For continuous values, the expectation is an integral: H [p] = − p(x) log p(x)dx. (19.49) H [p] = p(x) log p(x)dx. (19.49) p ( x We cannot simply maximize H(x ) with respect to the function ) , b ecause the − result migh mightt not be a probability distribution. Instead, we need to use Lagrange H( x p ( x), the We cannot ) with to the function because they p(xrespect multipliers, tosimply add amaximize constraint that ) in integrates tegrates to 1. Also, en entrop trop tropy Z result migh t not bebound a probability Instead, wemakes need to Lagrange increases without as the vdistribution. ariance increases. This theuse question of p ( x m ultipliers, to add a constraint that ) in tegrates to 1. Also, the en trop y whic which h distribution has the greatest entrop entropy y uninteresting. Instead, we ask which increases without b ound as the v ariance increases. This makes the question of distribution has maximal en entrop trop tropy y for fixed variance σ 2 . Finally Finally,, the problem whic h distribution has the greatest entropy uninteresting. Instead, we askwithout which is underdetermined because the distribution can be shifted arbitrarily maximal entrop for fixed variance . Finally , thet problem cdistribution hanging the has entrop entropy y. To imp impose ose y a unique solution, weσadd a constrain constraint that the is underdetermined b ecause the distribution can b e shifted arbitrarily without mean of the distribution be µ . The Lagrangian functional for this optimization cproblem hangingisthe entropy. To impose a unique solution, we add a constraint that the mean of the distribution be µ . The Lagrangian functional for this optimization Z problem is L[p] = λ1 p(x)dx − 1 + λ2 (E[x] − µ) + λ3 E[(x − µ)2 ] − σ2 + H [p] (19.50) E E [p] = λ p(x)dx 1 + λ ( [x] µ) + λ [(x µ) ] σ + H [p] (19.50) 649 L − − − − Z
CHAPTER 19. APPROXIMATE INFERENCE
=
Z
λ1p(x) + λ2 p(x)x + λ3p(x)(x − µ)2 − p(x) log p(x) dx − λ 1 − µλ 2 − σ 2 λ3.
λ p(x) + λ p(x)x + λ p(x)(x µ) p(x) log p(x) dx λ µλ (19.51) σ λ. p To minimize the Lagrangian with− resp respect ect derivatives atives − to , we set the functional − − deriv −(19.51) equal to 0: To minimize the Lagrangian with respect to p , we set the functional derivatives Z to 0: equal ∀x, δ L = λ 1 + λ2 x + λ3(x − µ)2 − 1 − log p(x) = 0. (19.52) δp(x) δ x, = λ + λ x + λ (x µ) 1 log p(x) = 0. (19.52) δp(x) no This condition now w tells us the functional form of p ( x). By algebraically ∀ L − − − re-arranging the equation, we obtain This condition now tells us the functional form 2of p ( x). By algebraically re-arranging the equation, we obtain −λ 1 − λ 2x + λ 3 (x − µ) + 1 . p(x) = exp (19.53) =
λ pλ( xx) +would µe) this p(x) directly = exp that λ (x tak + 1 functional . (19.53) We nev never er assumed take form; we − − − obtained the expression itself by analytically minimizing a functional. To finish p( xose e never assumed directly thatcho ) would e this functional form; we the W minimization problem, we must hoose the λ vtak alues to ensure that all of our obtained the byfree analytically functional. o finish constrain constraints ts areexpression satisfied. itself We are to cho hoose ose minimizing any λ values,a b ecause the T gradient thethe minimization we must the λisvalues tolong ensure thatconstraints all of our of Lagrangian problem, with resp respect ect to thecλhovose ariables zero so as the √the gradient λ constrain ts are satisfied. W e are free to c ho ose any v alues, b ecause are satisfied. To satisfy all of the constrain constraints, ts, we may set λ1 = log σ 2π , λ 2 = 00,, λ of the Lagrangian with resp ect to the v ariables is zero so long as the constraints 1 and λ3 = −2σ 2 to obtain are satisfied. To satisfy all of the constraints, we may set λ = log σ 2π , λ = 0, p(x) = N (x; µ, σ2 ). (19.54) √ and λ = to obtain This is one when we do not know the p(xnormal ) = (distribution x; µ, σ ). (19.54) − reason for using the true distribution. Because the normal N distribution has the maximum entrop entropy y, we This is one reason for using the normal distribution when w e do not know the imp the least p ossible amount of structure by making this assumption. impose ose true distribution. Because the normal distribution has the maximum entropy, we While examining the critical points of the Lagrangian functional for the entrop entropy y, impose the least possible amount of structure by making this assumption. we found only one critical poin oint, t, corresp corresponding onding to maximizing the en entrop trop tropy y for While examining the critical p oints of the Lagrangian functional for the entrop fixed variance. What ab about out the probability distribution function that minimizesy, w e found only oney critical poinfind t, corresp onding to maximizing the entrop for the entrop entropy? y? Wh Why did we not a second critical point corresp corresponding onding toy the fixed vum? ariance. thethere probability distribution function that minimizes minim minimum? TheWhat reasonabisout that is no sp specific ecific function that achiev achieves es minimal the entrop y? Wh y did we not find a second critical p oint corresp onding µ +the σ en entrop trop tropy y. As functions place more probability density on the tw two o points x = to minim um? The reason is that there is no sp ecific function that achiev es minimal and x = µ − σ, and place less probability density on all other values of x, they lose µ+σ entrop yy. while As functions place the moredesired probability density on the o pfunction oints x =placing en entrop trop tropy maintaining variance. Ho Howev wev wever, er, tw any x, they µ σmass and x = zero , andon place onin alltegrate other vto alues losea exactly all less but probability two poin oints ts density do does es not integrate one,ofand is not trop y while theThere desired Howev er, any function placing − maintaining ven alid probability distribution. thusvariance. is no single minimal entrop entropy y probabilit probability y exactly zero mass on all but t w o p oin ts do es not in tegrate to one, and is not a distribution function, muc much h as there is no single minimal positive real num number. ber. vInstead, alid probability distribution. thus is of noprobability single minimal entropy probabilit y we can say that there There is a sequence distributions con conv verging distribution function, mucon h as there is single minimal positive real num to tow ward putting mass only these tw two o pno oints. This degenerate scenario mayber. be Instead, we can say that there is a sequence of probability distributions con v erging describ described ed as a mixture of Dirac distributions. Because Dirac distributions are to w ard putting onlyprobabilit on theseytw o points. This degenerate scenario may bofe not describ described ed bymass a single probability distribution function, no Dirac or mixture described as a mixture of Dirac distributions. Because Dirac distributions are 650 not described by a single probability distribution function, no Dirac or mixture of
CHAPTER 19. APPROXIMATE INFERENCE
Dirac distribution corresp corresponds onds to a single sp specific ecific poin ointt in function space. These distributions are thus invisible to our metho method d of solving for a sp specific ecific point where Dirac distribution corresp onds to a single sp ecific p oin t in function These the functional deriv derivativ ativ atives es are zero. This is a limitation of the metho method. d. space. Distributions distributions are thus to by ourother metho d of solving specific the point where suc such h as the Dirac mustinvisible be found metho methods, ds, suchfor as aguessing solution the functional deriv ativ es are zero. This is a limitation of the metho d. Distributions and then proving that it is correct. such as the Dirac must be found by other methods, such as guessing the solution and then proving that it is correct.
19.4.3
Con Contin tin tinuous uous Latent Variables
When graphical mo model del con contains tains contin continuous uous latent variables, we may still 19.4.3ourCon tinuous Latent Variables perform variational inference and learning by maximizing L . How However, ever, we must When our graphical mo del con tains contin uous latent v ariables, no now w use calculus of variations when maximizing L with resp respect ect to q(we h | may v). still perform variational inference and learning by maximizing . However, we must most cases,ofpractitioners need maximizing not solve any calculus of variations problems nowInuse calculus variations when with resp Lect to q(h v). themselv themselves. es. Instead, there is a general equation for the mean field fixed point L calculus of variations| problems In most cases, practitioners need approximation not solve any up updates. dates. If we make the mean field themselves. Instead, there is a generalY equation for the mean field fixed point updates. If we make the mean field q(hi | v), q(h | v)approximation = (19.55) i
q(h v), q(h v) = (19.55) and fix q (h j | v) for all j 6= i, then the optimal ) may b e obtained by | v q ( h i | | normalizing the unnormalized distribution and fix q (h v) for all j = i, then the optimal q (h v ) may be obtained by normalizing the exp E h−i q˜(h i | v6 ) = distribution ˜(v|, h) (19.56) | unnormalized Y ∼q(h−i|v) log p E v) 0= probability exp q˜(h p˜(vconfiguration , h) (19.56) so long as p do does es not assign to any log joint of variables. Carrying out the exp expectation ectation inside the equation will yield the correct functional | so long not 0 probability to any joint configuration of variables. v )es form of qas(hpi |do . It is assign only necessary to derive functional forms of q directly using Carrying out the exp ectation inside the equation will yield the correct functional calculus of variations if one wishes to develop a new form of variational learning; v ). the form19.56 of q (hyields It ismean only field necessary forms to derive functional of q directly Eq. approximation for any probabilistic mo model. del. using calculus of v|ariations if one wishes to develop a new form of variational learning; Eq. 19.56 is a fixed point equation, designed to be iteratively applied for each Eq. 19.56 yields the mean field approximation for any probabilistic model. value of i rep repeatedly eatedly until conv convergence. ergence. Ho How wev ever, er, it also tells us more than that. It Eq. 19.56 is a fixed p oint equation, designed to bewill iteratively appliedwe forarriv eache tells us the functional form that the optimal solution take, whether arrive vthere alue of rep eatedly until conv ergence. Ho w ev er, it also tells us more than that. It i by fixed poin ointt equations or not. This means we can take the functional form tells us theequation functional thatsome the of optimal solution take, we arrive from that butform regard the values thatwill app appear ear inwhether it as parameters, therewe by can fixedoptimize point equations not. This means we can that with anyoroptimization algorithm wetake like.the functional form from that equation but regard some of the values that appear in it as parameters, As an example, consider a very simple probabilistic mo model, del, with latent variables that we2 can optimize with any optimization algorithm we like. h ∈ R and just one visible variable, v. Suppose that p (h ) = N (h ; 0, I ) and aevery simple probabilistic del,mo with variables p(v As ) =example, h; 1) |R han N (v; w>consider 1).. W could actually simplifymo this model del latent by integrating v. Suppose (h ) mo = del(hitself ; 0, I )isand h h; the and just is one visible variable, that v . pThe out result just a Gaussian distribution over model not p ( ) = ( v ; w h v h ; 1) . W e could actually simplify this mo del b y integrating ∈ N in interesting; teresting; we hav havee constructed it only to provide a simple demonstration of how h v . deling. out ; the result is just overmo The model itself is not | N calculus of variations ma may ya bGaussian e applieddistribution to probabilistic modeling. interesting; we have constructed it only to provide a simple demonstration of how calculus of variations may be applied to651 probabilistic modeling.
CHAPTER 19. APPROXIMATE INFERENCE
The true posterior is given, up to a normalizing constant, by pThe (h | true v ) posterior is given, up to a normalizing constant, by (19.57) ∝p(h, v) (19.58) p(h v ) (19.57) =p(h1 |)p(h2 )p(v | h) (19.59) p(h, v) (19.58) 2 1 2 2 ∝pexp = (h )p− (h )ph(1v+ h h2) + (v − h1 w 1 − h2 w 2 ) (19.59) ∝ (19.60) 2 | 1 exp (19.60) 1 h2 + h2 + (v2 h2 w 2 h2w 2) = exp − 2 h1 + h2 + v + h1 w 1 + h 2w 2 − 2vh1 w 1 − 2vh 2w 2 − 2h 1 w1 h2w 2 . ∝ −2 − − 1 2vh w 2vh w 2h w h w(19.61) = exp h +h +v +h w +h w . 2 − presence −h1 and − − we can see h2 together, Due to the of the terms multiplying that (19.61) the true posterior do does es not factorize ov over er h1 and h 2. Due to the presence of the terms multiplying h and h together, we can see that we factorize find that over h and h . Eq. 19.56 the Applying true posterior does ,not h1 | v,)we find that (19.62) Applying Eq.q˜(19.56 = exp Eh2∼q(h 2 |v) log p˜(v, h) (19.63) q˜(h v) (19.62) E 1 = exp | − Eh 2 ∼q(hlog p˜(hv21, + h)h22 + v 2 + h21 w 21 + h22 w22 (19.63) (19.64) 2 |v) 2 1E = exp h +h +v +h w +h w (19.64) 2 1 − 2vh2w2 − 2h 1 w1h2 w 2] . (19.65) −2vh 1 w − . o values we need to(19.65) w there w effectiv w ] tw 2vhthat 2vhare 2h welyh only From this, we can see effectively two obtain 2 from q(h 2 | v ): Eh−2 ∼q(h|v)[h− Writing riting these as hh2 i and hh 22i , 2] and Eh2− ∼q(h|v)[ h2 ]. W F rom this, we can see that there are effectiv ely only two values we need to obtain we obtain E E [h ] and [ h ]. Writing these as h and h , from q(h v ): 1 2 2 we obtain q˜|(h1 | v) = exp − h1 + hh2 i + v 2 + h21w12 + hh 22 iw22 h i (19.66) h i 2 1 q˜(h v) = exp h + h +v +h w + h w (19.66) (19.67) −2vh21 w1 − 2vhh2iw 2 − 2h1 w1hh2iw 2] . | − h i h i (19.67) 2vh w 2v h w 2h w h w ] . From this, we can see that q˜ has the functional form of a Gaussian. We can −N (h;µ,−β −1h) where i − µ and hdiagonal i β are variational th thus us conclude q (h | v ) = q ˜ F rom this, we can see that has the functional form of a Gaussian. Wortant e can parameters that we can optimize using any technique we choose. It is imp important q (hdidv )not = ever (h; assume µ, β ) that µould thus conclude where and bdiagonal β areitsvariational qw to recall that we e Gaussian; Gaussian parameters that we can optimize using any technique we choose. It is imp | N form was derived automatically by using calculus of variations to maximize qortant with q to recall that we did not ever assume that w ould b e Gaussian; its Gaussian resp respect ect to L . Using the same approach on a different mo model del could yield a different form was derived automatically by using calculus of v ariations to maximize q with functional form of q . respect to . Using the same approach on a different model could yield a different This was of course, just a small case constructed for demonstration purp purposes. oses. functional Lform of q . For examples of real applications of variational learning with con contin tin tinuous uous variables This w as of course, just a small case constructed for demonstration purposes. in the context of deep learning, see Go Goo odfellow et al. (2013d). For examples of real applications of variational learning with continuous variables in the context of deep learning, see Goo652 dfellow et al. (2013d).
CHAPTER 19. APPROXIMATE INFERENCE
19.4.4
In Interactions teractions bet etw ween Learning and Inference
Using inference aseen partLearning of a learning algorithm affects the learning 19.4.4approximate Interactions betw and Inference pro process, cess, and this in turn affects the accuracy of the inference algorithm. Using approximate inference as part of a learning algorithm affects the learning Sp Specifically ecifically ecifically, , theintraining algorithm tends to of adapt mo model del algorithm. in a way that makes process, and this turn affects the accuracy the the inference the approximating assumptions underlying the approximate inference algorithm Specifically , the training algorithm to adaptvthe model in a way that makes become more true. When training thetends parameters, ariational learning increases the approximating assumptions underlying the approximate inference algorithm Eh∼q become more true. When training thelog parameters, p(v, h). variational learning increases (19.68) E For a sp specific ecific v, this increases p( h | v )log forp(vvalues hav ve high probability , h). of h that ha (19.68) under q ( h | v) and decreases p (h | v) for values of h that hav havee low probability For a sp under q(ecific h | v)v. , this increases p( h v ) for values of h that have high probability under q ( h v) and decreases p (h | v) for values of h that have low probability This beha ehavior vior causes our approximating assumptions to become self-fulfilling under q(h | v). | prophecies. If we train the mo model del with a unimo unimodal dal approximate posterior, we will | vior causes our approximating assumptions to become self-fulfilling Thisa b eha obtain mo model del with a true posterior that is far closer to unimo unimodal dal than we would prophecies. If we train the mo del with a unimo dal approximate posterior, we will ha hav ve obtained by training the mo model del with exact inference. obtain a model with a true posterior that is far closer to unimodal than we would the true amount of harm imp imposed osed on a mo model del by a variational haveComputing obtained by training the model with exact inference. appro approximation ximation is th thus us very difficult. There exist several metho methods ds for estimating the true amount imptraining osed on the a mo del byand a variational log pComputing (v ). We often log pof ( v;harm θ) after estimate mo model, del, find that appro ximation is th us v ery difficult. There exist several metho ds for estimating the gap with L(v, θ , q ) is small. From this, we can conclude that our variational log p(vximation p(sp v; ecific θ) after ). We often estimate training the we moobtained del, and find appro approximation is accurate forlog the specific value of θ that fromthat the ( v , θ , q the gap with ) is small. F rom this, we can conclude that our v ariational learning pro process. cess. We should not conclude that our variational approximation is approximation specific vapproximation alue of θ that we from L is accurate accurate in general or that for thethe variational didobtained little harm to the the learning pro cess. W e should not conclude that our v ariational approximation is learning pro process. cess. To measure the true amount of harm induced by the variational accurate in general orould thatneed the to variational to the = max θ log p( v;did θ). little appro approximation, ximation, we w kno know w θ ∗ approximation It is harm possible for learning pro cess. T o measure the true amount of harm induced by the v ariational ∗ L(v, θ , q ) ≈ log p( v; θ ) and log p( v ; θ ) log p(v ; θ ) to hold simultaneously simultaneously.. If = max to log p( v; θ). It isofpaossible approL ximation, need toecause knowθθ∗ induces for ∗ , q ) we would ∗ max ( v , θ ( v ; θ ) , b too o complicated posterior log p q (v, θ , q ) ( v; θq) family ;θ) ; θ )learning log log p(vthe and logtop( vcapture, to hold simultaneously . er If distribution for pour then pro process cess will nev never max ( v , θ , q ) ( v ; θ θ )is, bvery ecause too complicated of aonly posterior p L induces approac approach h θ≈∗ . Such alog problem difficult to detect, because we can know q distribution for our family to capture, then the learning pro cess will nev ∗ L for sure that it happ happened ened if we hav havee a sup superior erior learning algorithm that can find θer θ approac h . Such a problem is very difficult to detect, b ecause w e can only know for comparison. for sure that it happened if we have a superior learning algorithm that can find θ for comparison.
19.5
Learned Appro Approximate ximate Inference
W e ha hav ve Learned seen that inference be though thought t of as an optimization pro procedure cedure 19.5 Approcan ximate Inference that increases the value of a function L. Explicitly performing optimization via W e havee pro seen that inference can pboint e though t of or as gradient-based an optimization procedure iterativ iterative procedures cedures suc such h as fixed equations optimization that increases the v alue of a function . Explicitly p erforming optimization via is often very exp expensive ensive and time-consuming. Man Many y approac approaches hes to inference av avoid oid iterative procedures such as fixed pointL equations or gradient-based optimization 653 is often very expensive and time-consuming. Many approaches to inference avoid
CHAPTER 19. APPROXIMATE INFERENCE
this exp expense ense by learning to perform appro approximate ximate inference. Specifically Specifically,, we can think of the optimization pro process cess as a function f that maps an input v to an this exp ense b y learning to perform appro ximate inference. Specifically, we can appro approximate ximate distribution q∗ = arg max q L( v , q). Once we think of the multi-step f that think ofe optimization the optimization proas cess a function an input an iterativ iterative pro process cess justasbeing a function, we maps can appro approximate ximatev ittowith appro ximate distribution ) . Once w e think of the multi-step q = arg max ( v , q a neural netw network ork that implements an approximation fˆ(v; θ). iterative optimization process as just being L a function, we can approximate it with a neural network that implements an approximation fˆ(v; θ).
19.5.1
Wak ake-Sleep e-Sleep
19.5.1 ake-Sleep One of theW main difficulties with training a mo model del to infer h from v is that we do not hav havee a sup supervised ervised training set with which to train the mo model. del. Given a v , h vends One of the main difficulties with training a mo del to infer from is that we we do not know the appropriate h. The mapping from v to h dep depends on the v, do not hav e a sup ervised training set with which to train the mo del. Given a choice of mo model del family family,, and evolv evolves es throughout the learning pro process cess as θ changes. h dep we do not knowalgorithm the appropriate from to ends on this the The wak wake-sleep e-sleep (Hin Hinton tonh.etThe al. al.,, mapping 1995b; Frey et val. al., , 1996 ) resolves choice of by modrawing del familysamples , and evolv throughout the learning process as θ changes. h and v from problem of besoth the mo model del distribution. F For or The wak e-sleep algorithm ( Hin ton et al. , 1995b ; F rey et al. , 1996 ) resolves this example, in a directed mo model, del, this can be done cheaply by performing ancestral v from problem by drawing samples both h the monetw del distribution. v . The sampling beginning at h and of ending atand inference network ork can thenFbor e example,toinperform a directed del, can be predicting done cheaply by hperforming trained the mo rev reverse erse this mapping: which caused theancestral presen presentt sampling beginning atk htoand at v Thewe inference then the be . The main dra drawbac wbac wback this ending approach is. that will onlynetw be ork ablecan to train v h trained tonet perform reverse predicting which caused the presen inference netw work onthe values of v mapping: that ha hav ve high probability under the mo model. del. Earlyt . The main dra wbac k to this approach is that we will only b e able to train the v in learning, the mo model del distribution will not resem resemble ble the data distribution, so the inference net work of ve an thatopp haortunity ve high probability the that model. Early inference netw network ork on willvalues not hav have opportunity to learn onunder samples resemble in learning, the model distribution will not resemble the data distribution, so the data. inference network will not have an opportunity to learn on samples that resemble In Sec. 18.2 we saw that one possible explanation for the role of dream sleep in data. human beings and animals is that dreams could provide the negativ negativee phase samples Sec. 18.2 wetraining saw thatalgorithms one possible for the of dream sleep in thatInMonte Carlo useexplanation to appro approximate ximate therole negativ negative e gradient of h uman b eings and animals is that dreams could provide the negativ e phase samples the log partition function of undirected mo models. dels. Another possible explanation for that Monte Carlo training algorithms use to approfrom ximate gradient of p( hthe , v)negativ biological dreaming is that it is providing samples which ecan be used the log partition function of undirected mo dels. Another p ossible explanation for to train an inference netw network ork to predict h giv given en v. In some senses, this explanation , v) which biological dreaming is that it is providing samples from p( hMonte be used is more satisfying than the partition function explanation. Carlocan algorithms h giv v. Inonly to train andoinference network somethe senses, this phase explanation generally not perform welltoifpredict they are runenusing positive of the is more satisfying than the partition function explanation. Monte Carlo algorithms gradien gradientt for several steps then with only the negative phase of the gradient for generally do not perform welland if they are are run usually using only the positive of the sev several eral steps. Human beings animals awak ake e for severalphase consecutive gradien t forasleep several then with only the negative the gradient for hours then for steps several consecutive hours. It is not phase readilyofapparent how this sev eral steps. beings andCarlo animals are usually awake for several sc schedule hedule couldHuman supp support ort Monte training of an undirected mo model. del.consecutive Learning hours then asleep for several consecutive hours. It is not readily apparent how this algorithms based on maximizing L can be run with prolonged perio eriods ds of improving sc hedule could supp ort Monte Carlo training of an undirected mo del. Learning q and prolonged perio eriods ds of impro improving ving θ , how however. ever. If the role of biological dreaming algorithms based on maximizing can b e run with prolonged eriods ofare improving q is to train netw for predicting , then this explains ho able to networks orks how w panimals qremain θ and prolonged p erio ds of impro ving , how ever. If the role of biological dreaming L awak akee for several hours (the longer they are aw awak ak ake, e, the greater the gap q is to train netw orks for predicting , then this explains ho animals are able to bet etw ween L and log p (v ), but L will remain a low lower er bound) wand to remain asleep remain awake for several hours (the longer they are awake, the greater the gap 654 a lower b ound) and to remain asleep between and log p (v ), but will remain L L
CHAPTER 19. APPROXIMATE INFERENCE
for several hours (the generative mo model del itself is not mo modified dified during sleep) without damaging their internal mo models. dels. Of course, these ideas are purely sp speculativ eculativ eculative, e, and for several hours (the generative mo del itself is not mo dified during sleep) without there is no hard evidence to suggest that dreaming accomplishes either of these damaging their internal moserve dels. Of course, theselearning ideas are purely speculativ e, and goals. Dreaming may also reinforcement rather than probabilistic there is no hard evidence to suggest that dreaming accomplishes either of these mo modeling, deling, by sampling synthetic exp experiences eriences from the animal’s transition mo model, del, goals. Dreaming also serve reinforcement learning on which to train may the animal’s policy olicy. . Or sleep may serverather some than otherprobabilistic purp purpose ose not by sampling synthetic experiences fromythe ymo et deling, an anticipated ticipated by the machine learning communit community . animal’s transition model, on which to train the animal’s policy. Or sleep may serve some other purpose not yet anticipated by the machine learning community.
19.5.2
Other Forms of Learned Inference
This strategy of learned approximate ximate inference has also been applied to other 19.5.2 Other Formsappro of Learned Inference mo models. dels. Salakh Salakhutdino utdino utdinov v and Laro Larochelle chelle (2010) show showed ed that a single pass in a This strategy of netw learned inference has than also iterating been applied to other learned inference network ork appro could ximate yield faster inference the mean field mo dels. Salakh utdino v and Laro chelle ( 2010 ) show ed that a single pass in a fixed poin ointt equations in a DBM. The training pro procedure cedure is based on running the learned inference orkapplying could yield inference thantoiterating field inference net netw work,netw then onefaster step of mean field impro improv vethe its mean estimates, fixed p oin t equations in a DBM. The training pro cedure is based on running and training the inference netw network ork to output this refined estimate instead ofthe its inferenceestimate. network, then applying one step of mean field to improve its estimates, original and training the inference network to output this refined estimate instead of its We hav have e already seen in Sec. 14.8 that the predictive sparse decomp decomposition osition original estimate. mo model del trains a shallo shallow w enco encoder der net netw work to predict a sparse co code de for the input. W e hav e already seen in Sec. 14.8 that the predictive sparse This can be seen as a hybrid betw etween een an auto autoenco enco encoder der and sparsedecomp coding.osition It is mo del trains a shallo w encodersemantics network for to predict a sparse de forthe theenco input. p ossible to devise probabilistic the mo model, del, underco which encoder der This can b e seen as a h ybrid b etw een an auto enco der and sparse coding. It is ma may y be viewed as performing learned approximate MAP inference. Due to its possible to devise probabilistic semantics for thethe model, which the benco der shallo shallow w enco encoder, der, PSD is not able to implement kindunder of comp competition etition etw etween een ma y b e viewed as p erforming learned approximate MAP inference. Due to units that we hav havee seen in mean field inference. Ho How wev ever, er, that problem can its be shallow enco der, PSDa is notenco ableder to to implement the kindappro of comp etition between remedied by training deep encoder perform learned approximate ximate inference, as units that w e hav e seen in mean field inference. Ho w ev er, that problem can b e in the IST ISTA A tec technique hnique (Gregor and LeCun, 2010b). remedied by training a deep encoder to perform learned approximate inference, as Learned appro approximate ximate inference has recen recently tly become one of the dominan dominantt in the ISTA technique (Gregor and LeCun, 2010b). approac approaches hes to generativ generativee mo modeling, deling, in the form of the variational auto autoenco enco encoder der Learned appro ximate recen tly become one of the dominan (Kingma , 2013 ; Rezende et inference al., 2014). has In this elegan elegant t approac approach, h, there is no need tot approaches to generativ mothe deling, in the form the variational autoenco der construct explicit targetsefor inference netw network. ork.ofInstead, the inference netw network ork (isKingma , 2013 ; Rezende et al. , 2014 ). In this elegan t approac h, there is no need to simply used to define Lelegan elegantt approach, there is no need the inference netw network ork construct explicit targetsL.for the mo inference network. Instead, inference ork. are adapted to increase This model del is describ described ed in depth the later, in Sec. netw 20.10.3 is simply used to define elegant approach, there is no need the inference network Using appro approximate ximate inference, it is possible to train and use a wide variety of are adapted to increase L. This model is described in depth later, in Sec. 20.10.3. mo models. dels. Many of these mo models dels are describ described ed in the next chapter. L Using approximate inference, it is possible to train and use a wide variety of models. Many of these models are described in the next chapter.
655
Chapter 20 Chapter 20
Deep Generativ Generativee Mo Models dels Deep Generativ e Mo dels In this chapter, we presen presentt sev several eral of the sp specific ecific kinds of generativ generativee mo models dels that can be built and trained using the techniques presented in Chapters 16, 17, 18 and In hapter, we dels presen t several of the spyecific kinds of generativ e models that 19 19..this All cof these mo models represen represent t probabilit probability distributions over multiple variables cansome be built trained using techniques presented infunction Chapters ,aluated 18 and in wayand . Some allo allow w the the probabilit probability y distribution to16 be, 17 ev evaluated 19 . All of these mo dels represen t probabilit y distributions o v er m ultiple v ariables explicitly explicitly.. Others do not allo allow w the ev evaluation aluation of the probability distribution in some w a y . Some allo w the probabilit y distribution function to beofevit, aluated function, but support op operations erations that implicitly require knowledge suc such h explicitly . Others notthe allodistribution. w the evaluation as drawing samplesdo from Some of of the theseprobability models aredistribution structured function, butmo support operations that of implicitly require knowledge it, such probabilistic models dels describ described ed in terms graphs and factors, using theoflanguage as graphical drawing samples from the distribution. of these models of models presented in Chapter 16Some . Others can not easilyare bestructured describ described ed probabilistic mo dels describ ed in terms of graphs and factors, using the language in terms of factors, but represen representt probabilit probability y distributions nonetheless. of graphical models presented in Chapter 16. Others can not easily be described in terms of factors, but represent probability distributions nonetheless.
20.1
Boltzmann Mac Machines hines
Boltzmann machines wereMac originally introduced as a general “connectionist” ap20.1 Boltzmann hines proac proach h to learning arbitrary probabilit probability y distributions over binary vectors (Fahlman Boltzmann originally introduced as; Hinton a general “connectionist” apet al. al.,, 1983; machines Ac Ackley kley et were al. al.,, 1985 ; Hinton et al. al.,, 1984 and Sejnowski, 1986 ). proac h to learning arbitrary probabilit y distributions o v er binary vectors ( F ahlman Variants of the Boltzmann mac machine hine that include other kinds of variables ha have ve long et al. , 1983 ; Ac kley et al. , 1985 ; Hinton et al. , 1984 ; Hinton and Sejnowski , 1986). ago surpassed the popularity of the original. In this section we briefly introduce Variants of Boltzmann the Boltzmann machine that include otherthat kinds of vup ariables ve long the binary mac machine hine and discuss the issues come when ha trying to ago surpassed the p opularityinofthe themo original. In this section we briefly introduce train and perform inference model. del. the binary Boltzmann machine and discuss the issues that come up when trying to Wand e define the Boltzmann over a d-dimensional binary random ve vector ctor train pderform inference inmachine the model. x ∈ {0, 1} . The Boltzmann machine is an energy-based mo model del (Sec. 16.2.4), We define the Boltzmann machine over a d-dimensional binary random vector 0, 1 . The Boltzmann machine is an energy-based model (Sec. 16.2.4), x ∈ { } 656 656
CHAPTER 20. DEEP GENERATIVE MODELS
meaning we define the join jointt probabilit probability y distribution using an energy function: (−E (x)) using an energy function: meaning we define the joint probabilitexp y distribution P (x ) = , (20.1) Z exp ( E (x)) P (x) =and Z is the partition , (20.1) wherePE (x) is the energy function function that ensures Z − that x P (x) = 1. The energy function of the Boltzmann machine is given by where E (x) is the energy function and Z is the partition function that ensures that P (x) = 1. The energy the machine is given(20.2) by E (xfunction ) = −x>of Ux − Boltzmann b>x, E (x)of=mo x x b x, and b is the vector of(20.2) where U is the “weigh “weight” t” matrix model delUparameters bias parameters. − − wherePU is the “weight” matrix of model parameters and b is the vector of bias In the general setting of the Boltzmann mac machine, hine, we are giv given en a set of training parameters. examples, each of which are n-dimensional. Eq. 20.1 describ describes es the join jointt probabilit probability y In the general setting of the Boltzmann mac hine, w e are giv en a set of training distribution ov over er the observed variables. While this scenario is certainly viable, nteractions examples, eachthe of which arein -dimensional. Eq. es the joint probabilit y it does limit kinds of interactions betw between een20.1 thedescrib observed variables to those distribution er wthe observed While this scenario certainly viable, describ described ed by ov the eigh eight t matrix.variables. Sp Specifically ecifically ecifically, , it means that the isprobability of one it does limit the kinds of in teractions betw een the observed v ariables to those unit being on is giv given en by a linear mo model del (logistic regression) from the values of the describ ed b y the w eigh t matrix. Sp ecifically , it means that the probability of one other units. unit being on is given by a linear model (logistic regression) from the values of the The Boltzmann machine becomes more pow powerful erful when not all the variables other units. are observed. In this case, the non-observed variables, or latent variables, can Boltzmann machine becomes more erful when all variables act The similarly to hidden units in a multi-la ulti-lay yerpow perceptron andnot mo model del the higher-order are observed. In this case, the non-observed v ariables, or latent v ariables, in interactions teractions among the visible units. Just as the addition of hidden unitscan to act vsimilarly to hidden unitsinto in aanmulti-la er perceptron and mo del higher-order con conv ert logistic regression MLP yresults in the MLP being a univ universal ersal in teractions the visible units. Just as thewith addition hidden units to appro approximator ximatoramong of functions, a Boltzmann machine hiddenofunits is no longer converttologistic regression into an MLP results in the MLP being universal limited mo modeling deling linear relationships betw etween een variables. Instead, the aBoltzmann appro ximator of functions, a Boltzmann machine with hidden units is no mac machine hine becomes a univ universal ersal appro approximator ximator of probability mass functionslonger ov over er limited tovariables modeling linear een discrete (Le Rouxrelationships and Bengiob, etw 2008 ). variables. Instead, the Boltzmann machine becomes a universal approximator of probability mass functions over x, in Formally ormally, , we decomp decompose ose and the units into to t).wo subsets: the visible units v and discrete variables (Le Roux Bengio 2008 the laten latentt (or hidden) units h. The energy function becomes Formally, we decompose the units x into two subsets: the visible units v and > the latent (or hidden) units . The function (20.3) E (v, h) = −v>hRv − venergy Wh − h> S h b −ecomes b>v − c>h. (20.3) E (v, h) = v Rv v W h h S h b v c h. Boltzmann Mac Machine hine Learning Learning algorithms for Boltzmann machines − − − − − are usually based on maximum likelihoo likelihood. d. All Boltzmann machines ha hav ve an Boltzmann Mac hine Learning Learning algorithms for Boltzmann machines in intractable tractable partition function, so the maximum likelihoo likelihood d gradien gradientt must be apare usually based on maximum likelihoo d. All Boltzmann machines have an pro proximated ximated using the tec techniques hniques describ described ed in Chapter 18. intractable partition function, so the maximum likelihood gradient must be apOne in interesting teresting prop property erty of Boltzmann machines when trained with learning proximated using the techniques described in Chapter 18. rules based on maximum likelihoo likelihood d is that the up update date for a particular weigh weightt One in teresting prop erty of Boltzmann machines when trained with learning connecting tw two o units dep depends ends only the statistics of those two units, collected rules based on maximum likelihood is that the update for a particular weight 657statistics of those two units, collected connecting two units depends only the
CHAPTER 20. DEEP GENERATIVE MODELS
under differen differentt distributions: Pmodel (v ) and Pˆdata (v) Pmodel (h | v ). The rest of the net netw work participates in shaping those statistics, but the weigh eightt can be up updated dated ˆ (v) P P ( v P ( h v under differen t distributions: ) and ) . The rest of without kno knowing wing an anything ything about the rest of the netw network ork or how those statistics wthe ere net w ork participates in shaping those statistics, but the w eigh t can b e up dated | pro produced. duced. This means that the learning rule is “lo “local,” cal,” whic which h mak makes es Boltzmann without kno wing an ything about the rest of the netw ork or how those were mac machine hine learning somewhat biologically plausible. It is conceiv conceivable ablestatistics that if eac each h produced. This means that the learning rule is “lomac cal,”hine, whicthen h makthe es Boltzmann neuron were a random variable in a Boltzmann machine, axons and machine learning somewhat biologically It only is conceiv able that each dendrites connecting tw twoo random variablesplausible. could learn by observing theif firing neuron were random in a Boltzmann hine, In then the axons pattern of thea cells thatvariable they actually physicallymac touch. particular, in and the dendrites connecting tw o random v ariables could learn only by observing the firing positiv ositivee phase, tw two o units that frequently activ activate ate together hav havee their connection pattern of theThis cellsisthat they actually physically touch.ruleIn(Hebb particular, the strengthened. an example of a Hebbian learning , 1949)inoften positive phase, two units that frequently activate together haveHebbian their connection summarized with the mnemonic “fire together, wire together.” learning strengthened. This is an example of a Hebbian learning rule ( Hebb , 1949 ) often rules are among the oldest hypothesized explanations for learning in biological summarized with the mnemonic “fire together, wire together.” Hebbian learning systems and remain relev relevant ant to today day (Giudice et al., 2009). rules are among the oldest hypothesized explanations for learning in biological Other learning algorithms that use more information than local statistics seem systems and remain relevant today (Giudice et al., 2009). to require us to hyp ypothesize othesize the existence of more machinery than this. For Other learning algorithms that useback-propagation more informationinthan local statistics seem example, for the brain to implement a multilay ultilayer er perceptron, to require us to h yp othesize the existence of more machinery than this. or it seems necessary for the brain to main maintain tain a secondary comm communication unication net netw workFfor example, for the brain to implement back-propagation in anet multilay perceptron, transmitting gradien gradient t information bac backw kw kwards ards through the netw work. er Prop Proposals osals for it seems necessary for the brain to main tain a secondary comm unication net work for biologically plausible implemen implementations tations (and appro approximations) ximations) of back-propagation transmitting gradien t information backw,ards the nettowork. Proposalsand for ha hav ve been made (Hinton , 2007a; Bengio 2015through ) but remain be validated, biologically plausible implemen tations (and appro ximations) of back-propagation Bengio (2015) links bac back-propagation k-propagation of gradien gradients ts to inference in energy-based ha vdels e been made Hinton , 2007a; mac Bengio but remain to blaten e validated, and mo models similar to(the Boltzmann machine hine, 2015 (but)with con contin tin tinuous uous latent t variables). Bengio (2015) links back-propagation of gradients to inference in energy-based The negativ negativee phase of Boltzmann mac machine hine learning is somewhat harder to models similar to the Boltzmann machine (but with continuous latent variables). explain from a biological poin ointt of view. As argued in Sec. 18.2, dream sleep ma may y The negativ e phase of Boltzmann mac hine learning is somewhat harder to be a form of negativ negativee phase sampling. This idea is more sp speculativ eculativ eculativee though. explain from a biological point of view. As argued in Sec. 18.2, dream sleep may be a form of negative phase sampling. This idea is more speculative though.
20.2
Restricted Boltzmann Mac Machines hines
In Inv ven ented ted Restricted under the nameBoltzmann harmonium (Smolensky , 1986), restricted Boltzmann 20.2 Machines mac machines hines are some of the most common building blo blocks cks of deep probabilistic mo models. dels. Inv ted under the name (Smolensky , 1986 ), restricted W een ha have ve briefly describ described ed harmonium RBMs previously previously, , in Sec. 16.7.1 . Here we Boltzmann review the machines information are some of the common blocks are of deep probabilistic models. previous andmost go into more building detail. RBMs undirected probabilistic We have briefly ed RBMs previously , able in Sec. 16.7.1.and Here we review graphical mo models delsdescrib con containing taining a la lay yer of observ observable variables a single lay layer erthe of previous information and go into more detail. RBMs are undirected probabilistic laten latentt variables. RBMs may be stack stacked ed (one on top of the other) to form deep deeper er graphical moFig. dels20.1 containing layer of observ able variables anda ashows singlethe lay er of mo models. dels. See for somea examples. In particular, Fig. 20.1 graph latent variables. RBMs mayItbeisstack ed (onegraph, on topwith of the other) to form deeper structure of the RBM itself. a bipartite no connections permitted mo dels. 20.1 forinsome examples.la In Fig.an 20.1 a shows the laten grapht b et etw ween See an any yFig. variables the observed layer yerparticular, or betw between een any y units in the latent structure of the RBM itself. It is a bipartite graph, with no connections p ermitted la lay yer. between any variables in the observed layer or between any units in the latent 658 layer.
CHAPTER 20. DEEP GENERATIVE MODELS
(2)
(2)
h1
h1
h2
v1
h3
v2
(1)
h4
v3
(1)
(1)
h1
(2)
h3
h2
h2
v1
v2
a
(1)
h3
h4
v3
b
a
(2)
(2)
h1
(1)
h2
v1
h3
(1)
(1)
h1
(1)
h3
v2
b
(2)
h2
h4
v3
c Figure 20.1: Examples of mo models dels that ma may ycbe built with restricted Boltzmann mac machines. hines. (a) The restricted Boltzmann mac machine hine itself is an undirected graphical model based on 20.1: graph, Examples models thatinma y bpart e built restricted macinhines. aFigure bipartite with of visible units one of with the graph and Boltzmann hidden units the (a) The restricted Boltzmann machine itselfthe is an undirected graphical model based on other part. There are no connections among visible units, nor any connections among a bipartite graph, Typically with visible units in one part of the graph and hidden the hidden units. every visible unit is connected to every hiddenunits unit in butthe it other part. There are no connections among the visible units, nor any connections among is p ossible to construct sparsely connected RBMs such as con conv volutional RBMs. (b) A the hidden units. every visible mo unit connected to every hidden but it deep b elief net netw workTypically is a hybrid graphical model delisin involving volving b oth directed and unit undirected is p ossible to Like construct sparsely connected RBMs as convolutional A connections. an RBM, it has no in intra-la tra-la tra-lay yer such connections. Ho How wever,RBMs. a DBN(b)has deep b elief net w ork is a h ybrid graphical mo del in volving b oth directed and undirected multiple hidden lay layers, ers, and th thus us there are connections b et etw ween hidden units that are in connections. RBM, has no inaltra-la yer connections. Honeeded wever, abyDBN has separate lay layers. ers.Like All an of the localit condition conditional probabilit probability y distributions the deep melief ultiple hidden laycopied ers, and thus there b etween hidden units that are in b net netw work are directly from are theconnections lo local cal conditional probability distributions of separate lay ers. All of the local condition al probabilit y distributions needed by the deep its constituent RBMs. Alternativ Alternatively ely ely,, we could also represen representt the deep b elief netw network ork with networkundirected are copiedgraph, directly the loneed cal conditional of ab elief completely butfrom it would in intra-la tra-la tra-lay yer probability connectionsdistributions to capture the its constituent Alternativ elyA, we could also represen t theisdeep b elief netwgraphical ork with dep dependencies endencies bRBMs. etw etween een paren parents. ts. (c) deep Boltzmann mac machine hine an undirected a completely undirected but it wouldLik need intra-la yer connections to capture the mo model del with sev several eral la lay yers graph, of latent variables. Like e RBMs and DBNs, DBMs lack intra-la intra-lay yer (c) dep endencies b etweenare paren deeptoBoltzmann macDBNs hine isare. an undirected graphical connections. DBMs lessts.closelyA tied RBMs than When initializing a modelfrom with asev eral of layRBMs, ers of latent variables. toLik e RBMs DBNs, DBMs lack intra-la yer DBM stack it is necessary mo modify dify theand RBM parameters sligh slightly tly tly.. Some connections. DBMs aretrained less closely tiedfirst to RBMs than DBNs are. When initializing a kinds of DBMs may be without training a set of RBMs. DBM from a stack of RBMs, it is necessary to modify the RBM parameters slightly. Some kinds of DBMs may be trained without first training a set of RBMs. 659
CHAPTER 20. DEEP GENERATIVE MODELS
We begin with the binary version of the restricted Boltzmann machine, but as we see later there are extensions to other typ ypes es of visible and hidden units. We begin with the binary version of the restricted Boltzmann machine, but as More formally formally,, let the observ observed ed la lay yer consist of a set of v binary random we see later there are extensions to other types of visible and n hidden units. variables which we refer to collectively with the vector v. We refer to the latent or More let therandom observed layer consist hidden la lay yformally er of nh, binary variables as h. of a set of n binary random variables which we refer to collectively with the vector v. We refer to the latent or Lik Likee the general Boltzmann mac machine, hine, the restricted Boltzmann mac machine hine is an hidden layer of n binary random variables as h. energy-based mo model del with the join jointt probabilit probability y distribution specified by its energy Lik e the general Boltzmann mac hine, the restricted Boltzmann machine is an function: 1 y distribution specified by its energy energy-based model with the joint probabilit (20.4) P (v = v, h = h) = exp (−E (v, h)) . function: Z 1 The energy function for given en bexp y ( E (v, h)) . (20.4) P (van=RBM v, h =ishgiv )= Z − > en by The energy function forEan (v,RBM h) = is −bgiv v − c> h − v >W h, (20.5) E (constant v, h) = known b v as c the h partition v W h, function: and Z is the normalizing X− X − − and Z is the normalizing constant known as the function: exp {− {−E E (vpartition , h)} . Z= v
(20.5) (20.6)
h
exp E (v, h) . (20.6) Z= Z that the naiv It is apparen apparentt from the definition of the partition function naive e metho method d {− } of computing Z (exhaustiv (exhaustively ely summing over all states) could be computationally Z thatregularities It tractable, is apparenunless t from the definition of the algorithm partition function the naive metho d in intractable, a cleverly designed could exploit in the Z (exhaustiv of computing summing over all be computationally XX Z faster. probabilit probability y distribution to ely compute In states) the casecould of restricted Boltzmann in tractable, unless a cleverly designed algorithm could exploit regularities in the mac machines, hines, Long and Servedio (2010) formally prov proved ed that the partition function Z Z probabilit y distribution to compute faster. In the case of restricted Boltzmann is in intractable. tractable. The intractable partition function Z implies that the normalized macthines, Longyand Servedio (P2010 proved that the partition function Z join joint probabilit probability distribution (v) )isformally also intractable to ev evaluate. aluate. is intractable. The intractable partition function Z implies that the normalized joint probability distribution P (v) is also intractable to evaluate.
20.2.1
Conditional Distributions
(v ) is intractable, Though the bipartite graph structure of the RBM has the 20.2.1 P Conditional Distributions very sp special ecial property that its conditional distributions P(h | v ) and P (v | h) are (v )relativ Though is intractable, bipartite and graph structure of the RBM has the factorial Pand relatively ely simplethe to compute to sample from. very special property that its conditional distributions P(h v ) and P (v h) are Deriving the conditional distributions from the join jointt distribution is straightforfactorial and relatively simple to compute and to sample from. | | ward: Deriving the conditional distributions from the joint distribution is straightforP (h, v) ward: P (h | v) = (20.7) P (v) P (h, v) n o P (h v) = 1 1 (20.7) (20.8) = P (v) exp b >v + c>h + v > W h P ( v ) Z | 1 1 (20.8) = 1 exp > b v>+ c h + v W h = P (0 vexp (20.9) ) Z c h + v Wh Z 1 = exp c 660 h + v Wh (20.9) n o Z n o
n
o
CHAPTER 20. DEEP GENERATIVE MODELS
nh X
nh X
1 exp cj hj + v> W:,j hj (20.10) Z0 j=1 j=1 1 = exp v Wo h (20.10) nh nc h + Z1 Y > = 0 exp c jh j + v W:,j h j (20.11) Z j=1 1 c h +v W h = exp (20.11) Z these as constan X units vX Since we are conditioning on the visible , we can treat constantt with resp respect ect to the distribution P (h | v). The factorial nature of the conditional v , write Since we are ws conditioning on from the visible units we can as constan ) follo follows immediately our abilit ability y to thetreat join jointtthese probabilit probability y ov over ert P (h | v n o P ( with resp ect to the distribution ) . The factorial nature of the conditional h v the vector h as the product of Y (unnormalized) distributions ov over er the individual ) follo ws immediately from our abilit y to write the join t probabilit yerov er P ( h v | h elemen elements, ts, j . It is now a simple matter of normalizing the distributions ov over the h as the the vector | individual binary hj.product of (unnormalized) distributions over the individual elements, h . It is now a simple matter of normalizing the distributions over the individual binary h . P˜(hj = 1 | v ) P (h j = 1 | v ) = (20.12) P˜(h j = 0˜| v ) + P˜ (hj = 1 | v ) P(h = 1 v ) P (h = 1 v ) = (20.12) exp cj + v >˜W ˜ = 0 v ) + P |(h:,j= 1 v ) P ( h (20.13) = | exp {0} + exp {cj + v > W:,j } | exp |c + v W (20.13) = > =σ W:,j c . + v W (20.14) expcj 0+ v + exp { } = σ c{ +} vovW (20.14) We can no now w express the full conditional er the. hidden la layer yer as the factorial distribution: n h We can now express the full Y conditional over the hidden layer as the factorial (20.15) σ (2h − 1) (c + W > v) . P (h | v) = distribution: j j=1 (20.15) σ (2h 1) (c + W v) . P (h v) = A similar deriv derivation ation |will show that the−othercondition of interest to us, P ( v | h), is also a factorial distribution: A similar derivation will show that the other condition of interest to us, P ( v h), nv Y is also a factorial distribution:Y | (20.16) P (v | h) = σ ((2v − 1) (b + W h))i . =
i=1
20.2.2
P (v h) = σ ((2v 1) (b + W h)) . | − Machines Training Restricted Boltzmann
(20.16)
Because RBM admits efficientBoltzmann ev evaluation aluation and differentiati differentiation on of P˜ ( v) and 20.2.2 the Training Restricted Machines Y efficien efficientt MCMC sampling in the form of blo block ck Gibbs sampling, it can readily be Becausewith the RBM efficient describ evaluation differentiati of P˜ ( vmodels ) and trained an any y of admits the techniques described ed inand Chapter 18 for on training efficien t MCMC sampling in thefunctions. form of blo ck Gibbs sampling, it can readily be that hav have e in intractable tractable partition This includes CD, SML (PCD), ratio trained of the techniques describ ed in Chapter for in training models matc matching hingwith andan soyon. Compared to other undirected models18used deep learning, that hav e in tractable partition functions. This includes CD, SML (PCD), | v) the RBM is relatively straightforw straightforward ard to train because we can compute P (hratio matching and so on. Compared to other undirected models used in deep learning, the RBM is relatively straightforward 661 to train because we can compute P (h v) |
CHAPTER 20. DEEP GENERATIVE MODELS
exactly in closed form. Some other mac machine, hine, com combine bine both the difficult difficulty y exactly in closed form. Some other difficult difficulty y of in intractable tractable inference. machine, combine both the difficulty difficulty of intractable inference.
20.3
deep mo models, dels, suc such h as the deep Boltzmann of an in intractable tractable partition function and the deep models, such as the deep Boltzmann of an intractable partition function and the
Deep Belief Net Netw works
De Deep ep belief networks (DBNs) were one of the first non-con non-convolutional volutional models to 20.3 Deep Belief Net works successfully admit training of deep architectures (Hin Hinton ton et al. al.,, 2006; Hin Hinton ton, De ep b elief networks (DBNs) w ere one of the first non-con volutional models to 2007b 2007b). ). The in intro tro troduction duction of deep belief net netw works in 2006 began the curren currentt deep successfully admit training architectures (Hin tonnetw et al. , 2006 ; Hin ton, learning renaissance. Prior toof thedeep introduction of deep belief networks, orks, deep models 2007b ). The intro duction of deep belief net works machines in 2006 began the curren t deepe w ere considered to too o difficult to optimize. Kernel with conv convex ex ob objectiv jectiv jective learning renaissance. to thehintroduction of deep beliefnetw netw orks, deep models functions dominated Prior the researc research landscap landscape. e. Deep belief networks orks demonstrated were deep considered too difficult machines with convex ob jective that arc architectures hitectures can to be optimize. successful,Kernel by outp outperforming erforming kernelized support functions dominated the researc h landscap e. Deep b elief netw orks demonstrated vector mac machines hines on the MNIST dataset (Hinton et al., 2006). Toda day y, deep belief that deephav arcehitectures can be by are outp erforming support net netw works have mostly fallen outsuccessful, of fav favor or and rarely used,kernelized ev even en compared to v ector mac hines on the MNIST dataset ( Hinton et al. , 2006 ). T o da y , deep b elief other unsup unsupervised ervised or generative learning algorithms, but they are still deserv deservedly edly net w orks hav e mostly fallen out of fav or and are rarely used, ev en compared to recognized for their imp importan ortan ortantt role in deep learning history history.. other unsupervised or generative learning algorithms, but they are still deservedly Deep belief net netw works are generative mo models dels with several lay layers ers of laten latentt variables. recognized for their important role in deep learning history. The laten latentt variables are typically binary binary,, while the visible units may be binary DeepThere belief are netwno orks are generative models with several layers of in laten t variables. or real. intra-la intra-lay yer connections. Usually Usually, , ev every ery unit each lay layer er is The laten t v ariables are t ypically binary , while the visible units may b e binary connected to ev every ery unit in each neigh neighboring boring lay layer, er, though it is possible to construct or real. There are no intra-la y er connections. Usually , een every each lay is more sparsely connected DBNs. The connections betw etween theunit topintw two o lay layers erserare connected toThe every unit in each neigh layer, though it is possible construct undirected. connections betw etween eenboring all other lay layers ers are directed, withtothe arro arrows ws more sparsely connected DBNs. The connections b etw een the top tw o lay ers are poin ointed ted to tow ward the lay layer er that is closest to the data. See Fig. 20.1b for an example. undirected. The connections between all other layers are directed,(1) with the arro ws (l) . It DBN withthe la lay yers contains eight t matrices: l hidden l weigh Wb for , . .an .,W poinAted toward layer that is closest to the data. See Fig. 20.1 example. also con contains tains l + 1 bias vectors: b (0), . . . , b(l), with b(0) pro providing viding the biases for the A DBN hidden layers contains represen matrices: l weightted W ,is. .giv .,W visible la lay yer.with The lprobability distribution represented by the DBN given en by. It also contains l + 1 bias vectors: b , . . . , b , with b providing the biases for the > 1) probability (l−1)> (l−1) ted(l− 1)the > DBN (l)yer.(l− (l) distribution (l) (l) (l)is given by visible la The represen b y , (20.17) ) ∝ exp b h + b P (h , h h +h W h , (20.17) ) exp b (kh) + b(k+1)> h(k+1) P (h +h W h (k) , h P (hi = 1 | h(k+1) ) = σ bi + W :,i h ∀i, ∀k ∈ 1, . . . , l − 2, (20.18) ∝ (1)> (0) P (h = 1 h P ()v ==σ1 b| h (1) +) W k ∀1i., . . . , l 2, (20.18) (20.19) = σ bi h+ W :,i i,h(1) i | ∀ ∀ ∈ − 1 h units, i. (20.19) Palued (v =visible ) = σ substitute b +W h In the case of real-v real-valued ∀ | vunits, In the case of real-valued visible ; b(0) +substitute W (1)> h(1), β −1 (20.20) v∼ N v; b
v
∼N
+W
h ,β
662
(20.20)
CHAPTER 20. DEEP GENERATIVE MODELS
with β diagonal for tractability tractability.. Generalizations to other exp exponential onential family visible units are straightforw straightforward, ard, at least in theory theory.. A DBN with only one hidden la layer yer is β with diagonal for tractability . Generalizations to other exp onential family visible just an RBM. units are straightforward, at least in theory. A DBN with only one hidden layer is o generate several eral steps of Gibbs sampling justTan RBM. a sample from a DBN, we first run sev on the top two hidden lay layers. ers. This stage is essential essentially ly drawing a sample from o generate a sample a oDBN, we la first run eralthen stepsuse of Gibbs the T RBM defined by the from top tw hidden lay yers. Wsev e can a singlesampling pass of on the top t w o hidden lay ers. This stage is essential ly drawing a sample from ancestral sampling through the rest of the mo model del to draw a sample from the visible the RBM defined by the top two hidden layers. We can then use a single pass of units. ancestral sampling through the rest of the model to draw a sample from the visible Deep belief net netw works incur man many y of the problems asso associated ciated with both directed units. mo models dels and undirected mo models. dels. Deep belief networks incur many of the problems associated with both directed Inference in a deep belief net netw work is in intractable tractable due to the explaining aw away ay models and undirected models. effect within each directed la lay yer, and due to the in interaction teraction betw etween een the two hidden Inference in a deep belief net w ork is in tractable due to the explaining away la ers that hav undirected connections. Ev or maximizing the standard lay y havee Evaluating aluating effect within each directed laylog-likelihoo er, and due dtoisthe interaction betw een thethe twevidence o hidden evidence low lower er bound on the log-likelihood also in intractable, tractable, because law yers e undirected connections. Evaluating lo low er bthat oundhav takes the exp expectation ectation of cliques whose or sizemaximizing is equal to the thestandard netw network ork evidence low er b ound on the log-likelihoo d is also in tractable, b ecause the evidence width. lower bound takes the expectation of cliques whose size is equal to the network Ev Evaluating aluating or maximizing the log-like log-likeliho liho lihoo o d requires not just confron confronting ting the width. problem of in intractable tractable inference to marginalize out the laten latentt variables, but also Ev aluating or maximizing the log-like liho o d requires not undirected just confronmo ting the problem of an intractable partition function within the model delthe of problem of in tractable inference to marginalize out the laten t v ariables, but also the top two la lay yers. the problem of an intractable partition function within the undirected model of To train a deep belief netw network, ork, one begins by training an RBM to maximize the top two layers. Ev∼pdata log p(v ) using contrastiv contrastivee div divergence ergence or sto stochastic chastic maxim maximum um lik likeliho eliho elihoo od. T o train a deep b elief netw ork, one b egins b y training an RBM to maximize The layer yer of the E parameters of the RBM then define the parameters of the first la log p ( v ) using contrastiv e div ergence or sto chastic maxim um lik elihood. DBN. Next, a second RBM is trained to appro approximately ximately maximize The parameters of the RBM then define the parameters of the first layer of the ) (20.21) p(2)(h(1) E v∼pdata (1)|v) log DBN. Next, a second RBM is E trained ximately maximize h(1) ∼p(1)to (happro E E p(2) (20.21) where p(1) is the probabilit log pted(hby )the first RBM and probability y distribution represen represented is the probability distribution represen represented ted by the second RBM. In other words, p where is the probabilit y distribution ted bydefined the first and pthe the second RBM is trained to model therepresen distribution byRBM sampling is the probability distribution represen ted by the second RBM. In other ords, hidden units of the first RBM, when the first RBM is driven by the data.wThis the secondcan RBM trained to model the defined pro procedure cedure be isrep repeated eated indefinitely indefinitely, , todistribution add as many lay layers ersby to sampling the DBNthe as hidden units of h the first RBM, when the the samples first RBM is driven byone. the data. This desired, with eac each new RBM mo modeling deling of the previous Eac Each h RBM procedure can be repof eated indefinitely , tocedure add ascan many layers to DBN asa defines another la lay yer the DBN. This pro procedure be justified as the increasing desired, with eac h new RBM mo deling the samples of the previous one. Eac RBM variational low lower er bound on the log-likelihoo log-likelihood d of the data under the DBN (hHin Hinton ton defines another la y er of the DBN. This pro cedure can b e justified as increasing a et al. al.,, 2006). variational lower bound on the log-likelihood of the data under the DBN (Hinton In most applications, no effort is made to join jointly tly train the DBN after the greedy et al., 2006). la lay yer-wise procedure is complete. How However, ever, it is possible to perform generative In most applications, no effort is made to jointly train the DBN after the greedy fine-tuning using the wak ake-sleep e-sleep algorithm. layer-wise procedure is complete. However, it is possible to perform generative 663 fine-tuning using the wake-sleep algorithm.
CHAPTER 20. DEEP GENERATIVE MODELS
The trained DBN may be used directly as a generativ generativee mo model, del, but most of the in interest terest in DBNs arose from their abilit ability y to improv improvee classification mo models. dels. We can The trained DBN may b e used directly as a generativ e mo del, but most of the tak takee the weigh eights ts from the DBN and use them to define an MLP: interest in DBNs arose from their ability to improve classification models. We can take the weights from the DBN to (1) define h(1) =and σ use b (1) them + v >W . an MLP: (20.22) =σ b +v W h . (l) h(l) = σ bi + h (l−1)>W (l) ∀l ∈ 2, . . . , m,
(20.22) (20.23)
After initializing this the weigh weights via generativ generative 2, . . . ,learned m, (20.23)e h MLP = σ bwith +h Wts andl biases training of the DBN, we ma may y train the ∈ a classification task. This MLP to p∀erform After initializing this MLP with the weigh ts and biases learned via generative additional training of the MLP is an example of discriminativ discriminative e fine-tuning. training of the DBN, we may train the MLP to perform a classification task. This This sp specific ecific choice of MLP is somewhat arbitrary arbitrary, , compared to many of the is discriminativ additional training of the MLP an example of e fine-tuning. inference equations in Chapter 19 that are deriv derived ed from first principles. This MLP specific choice of seems MLP is , compared to many of the is aThis heuristic choice that to somewhat work well arbitrary in practice and is used consistently inference equationsMany in Chapter 19 thatinference are derivtechniques ed from firstare principles. This in the literature. approximate motiv motivated ated by MLP their is a heuristic choice that seems to work well in practice and is used consistently abilit ability y to find a maximally tight variational lo low wer bound on the log-likelihoo log-likelihood d in the some literature. approximate are low motiv ated byontheir under set of Many constraints. One caninference constructtechniques a variational lower er bound the abilit yeliho to find a maximally tight lowdefined er bound log-likelihoo d log-lik log-likeliho elihoo od using the hidden unitvariational exp expectations ectations by on thethe DBN’s MLP MLP,, but underis some setany of constraints. can construct a vhidden ariational lower bound this true of probability One distribution ov over er the units, and thereonis the no log-lik eliho o d using the hidden unit exp ectations defined by the DBN’s MLP , but reason to believe that this MLP provides a particularly tight bound. In particular, this MLP is true of anyman probability distribution overinthe units, andmo there no the ignores many y imp importan ortan ortant t in interactions teractions thehidden DBN graphical model. del. isThe reasonpropagates to believe that this MLP provides particularly tighttobound. In est particular, MLP information upw upward ard froma the visible units the deep deepest hidden the MLP ignores man y imp ortan t in teractions in the DBN graphical mo del. The units, but do does es not propagate an any y information down downw ward or sidewa sideways. ys. The DBN MLP propagates information upw ard from the visible units to the deep est hidden graphical mo model del has explaining away in interactions teractions betw between een all of the hidden units units, but es not propagate antop-do y information downward or sidewa ys. The DBN within the do same la lay yer as well as top-down wn in interactions teractions bet etw ween la lay yers. graphical model has explaining away interactions between all of the hidden units While log-likelihoo log-likelihood of as a DBN intractable, it ma may with within thethe same layer as wdell top-doiswn interactions betywbe eenapproximated layers. AIS (Salakh Salakhutdino utdino utdinov v and Murray, 2008). This permits ev evaluating aluating its qualit quality y as a While the log-likelihoo d of a DBN is intractable, it ma y be approximated with generativ generativee mo model. del. AIS (Salakhutdinov and Murray, 2008). This permits evaluating its quality as a The term “deep belief net netw work” is commonly used incorrectly to refer to an any y generative model. kind of deep neural net netw work, ev even en net netw works without latent variable semantics. The term “deep b elief net w ork” is commonly used incorrectly to refer to any The term “deep belief net network” work” should refer sp specifically ecifically to mo models dels with undirected kind of deepinneural netw ork, ev netdirected works without latentpvointing ariable do semantics. connections the deep deepest est lay layer erenand connections down wn wnw ward The “deep belief should refer b et etw wterm een all other pairsnet ofwork” consecutiv consecutive e la lay yers.specifically to models with undirected connections in the deepest layer and directed connections pointing downward The term “deep belief net netw work” ma may y also cause some confusion because the between all other pairs of consecutive layers. term “b “belief elief net netw work” is sometimes used to refer to purely directed mo models, dels, while The term “deep b elief net w ork” ma y also cause some confusion because the deep belief net netw works con contain tain an undirected lay layer. er. Deep belief net networks works also share termacron “belief work” is dynamic sometimes used to netw referorks to purely models, while the acronym ymnet DBN with Bay Bayesian esian networks (Dean directed and Kanazaw Kanazawa a, 1989 ), deep b elief net w orks con tain an undirected lay er. Deep b elief net works also share whic which h are Ba Bay yesian net netw works for represen representing ting Mark Marko ov chains. the acronym DBN with dynamic Bayesian networks (Dean and Kanazawa, 1989), which are Bayesian networks for represen ting Markov chains. 664
CHAPTER 20. DEEP GENERATIVE MODELS
(2)
(2)
h1
(1)
(1)
(1)
h1
h2
v1
(2)
h3
h2
(1)
h3
v2
h4
v3
Figure 20.2: The graphical mo model del for a deep Boltzmann machine with one visible lay layer er (b (bottom) ottom) and tw two o hidden la lay yers. Connections are only betw between een units in neigh neighb boring lay layers. ers. Figureare 20.2: graphical mo del for a deep Boltzmann machine with one visible layer There no The intra-la intra-lay yer la lay yer connections. (bottom) and two hidden layers. Connections are only between units in neighboring layers. There are no intra-layer layer connections.
20.4
Deep Boltzmann Mac Machines hines
A de deep ep Boltzmann machine or DBMMac (Salakh Salakhutdino utdino utdinov v and Hin Hinton ton, 2009a) is another 20.4 Deep Boltzmann hines kind of deep, generative mo model. del. Unlike the deep belief net netw work (DBN), it is an A de ep Boltzmann machine or DBM ( Salakh utdino v and Hin ton , 2009a ) is of another en entirely tirely undirected mo model. del. Unlike the RBM, the DBM has several la lay yers laten latentt kind of deep, generative mo del. Unlike the deep belief net w ork (DBN), it an variables (RBMs hav havee just one). But like the RBM, within each lay layer, er, eac each h ofisthe tirely undirected moindep del. Unlike RBM, theonDBM several layers ofboring latent ven ariables are mutually independent, endent,the conditioned the vhas ariables in the neigh neighboring vla ariables (RBMs haveforjust like the Deep RBM,Boltzmann within each lay er, eac hveofbeen the lay yers. See Fig. 20.2 theone). graphBut structure. mac machines hines ha hav vapplied ariablestoare conditioned the variables in the neigh a vmutually ariet ariety y of indep tasksendent, including do documen cumen cumentton mo modeling deling (Sriv Srivasta asta astav va et al. al.,,boring 2013). layers. See Fig. 20.2 for the graph structure. Deep Boltzmann machines have been Lik Likee RBMs and DBNs, DBMs typically contain only binary units—as we applied to a variety of tasks including document modeling (Srivastava et al., 2013). assume for simplicit simplicity y of our presentation of the mo model—but del—but it is straigh straightforw tforw tforward ard Lik e RBMs and DBNs, DBMs typically contain only binary units—as we to include real-v real-valued alued visible units. assume for simplicity of our presentation of the model—but it is straightforward A DBM is an energy-based mo model, del, meaning that the the joint probabilit probability y to include real-valued visible units. E distribution over the mo v ariables is parametrized by an energy function . In model del DBM an Boltzmann energy-based model,with meaning that la the jointthree probabilit v , and the A case of a is deep machine one visible lay yer,the hiddeny distribution ov(2) er and the mo variables is parametrized by ban la lay yers, h(1) , h h(3)del , the join jointt probabilit probability y is giv given en y: energy function E. In the case of a deep Boltzmann machine with one visible layer, v , and three hidden 1 t probabilit layers, h , h(1) and h , the join y is given by: P v, h , h(2), h(3) = exp −E (v, h (1), h(2) , h(3); θ) . (20.24) Z (θ ) 1 P v, h , h , h = exp E (v, h , h , h ; θ) . (20.24) To simplify our presen presentation, tation, weZomit (θ) the bias parameters below. The DBM energy − function is then defined as follo follows: ws: To simplify our presentation, we omit the bias parameters below. The DBM energy > (1)> (3) (3) function )= −vws: E (v, his(1)then , h(2)defined , h (3); θas W (1) h(1) W (2)h(2) − h(2)> follo −h W h . (20.25) h h E (v, h , h , h ; θ) = v W h W h W h . (20.25) − − 665 −
CHAPTER 20. DEEP GENERATIVE MODELS
(2)
h1
(3)
(2)
h2
h1 (3) h1
(3) h2 (3)
h2 (2)
(2)
(2)
h1
h3
h2
(2)
h3
(1)
h1 (1)
(1)
h1
(1)
h3
h2
v1
v2
v1
h2
v2
h3
(1)
(1)
Figure 20.3: A deep Boltzmann machine, re-arranged to reveal its bipartite graph structure. Figure 20.3: A deep Boltzmann machine, re-arranged to reveal its bipartite graph structure.
In comparison to the RBM energy function (Eq. 20.5), the DBM energy function includes connections betw etween een the hidden units (latent variables) in the In comparison to the RBM energy function ), these the DBM energy (2) (3)). As (Eq. form of the weigh weightt matrices (W and W we will20.5 see, connections function includes connections for betw een the the mo hidden (latent variables) ha hav ve significan significant t consequences both model del bunits eha ehavior vior as well as how in wethe go W W form of the weigh t matrices ( and ). As we will see, these connections ab about out performing inference in the mo model. del. have significant consequences for both the model behavior as well as how we go In comparison to fully connected Boltzmann machines (with every unit conabout performing inference in the model. nected to ev every ery other unit), the DBM offers some adv advantages antages that are similar to In comparison to fully connected Boltzmann machines every unit conthose offered by the RBM. Specifically Specifically,, as illustrated in Fig. (with 20.3, the DB DBM M la lay y ers nected to ev ery other unit), the DBM offers some adv antages that are similar to can be organized in into to a bipartite graph, with odd la lay yers on one side and even lay layers ers those theimmediately RBM. Specifically as illustrated Fig. 20.3 , the M layers on theoffered other. by This implies, that when we in condition on the DB variables in can be organized in to a bipartite graph, with o dd la y ers on one side and even lay ers the ev even en la lay yer, the variables in the odd la lay yers become conditionally indep independen enden endent. t. on the other. This immediately implies that when w e condition on the v ariables in Of course, when we condition on the variables in the odd la lay yers, the variables in the ev en la y er, the v ariables in the o dd la y ers b ecome conditionally independent. the ev even en la lay yers also become conditionally indep independen enden endent. t. Of course, when we condition on the variables in the odd layers, the variables in The bipartite structure of the DBM means that we can apply the same equathe even layers also become conditionally independent. tions we ha hav ve previously used for the conditional distributions of an RBM to The bipartite structure distributions of the DBM means that we apply the same equadetermine the conditional in a DBM. Thecan units within a lay layer er are tions w e ha v e previously used for the conditional distributions of an RBM to conditionally indep independent endent from each other giv given en the values of the neigh neighb boring determine the distributions conditional distributions a DBM.can The a layber are la lay yers, so the over binary in variables beunits fully within described y the conditionally independent from other giv the unit values of the neighb Bernoulli parameters giving theeach probability ofeneach being active. Inoring our la y ers, so the distributions o v er binary v ariables can b e fully described b y the example with two hidden la lay yers, the activ activation ation probabilities are given by: Bernoulli parameters giving the probability of each unit being active. In our (1) (1) example with two hidden ers,1 |the are given by: (20.26) P (la v y= ) = ation σ Wprobabilities h h(1)activ , i
P (v = 1 h |
i,:
)666 =σ W
h
,
(20.26)
CHAPTER 20. DEEP GENERATIVE MODELS
and
(1) (1) (2) P (h i = 1 | v , h (2)) = σ v >W:,i + Wi,: h(2) P (h
= 1 v , h ) = σ v W + W h (2) (2) P (hk | = 1 | h (1)) = σ h (1)>W:,k .
(20.27) (20.27) (20.28)
and = 1es h P (h mak ) =sampling σ h W . Boltzmann machine (20.28) The bipartite structure makes Gibbs in a deep efficien efficient. t. The naiv naivee approach to |Gibbs sampling is to updateonly one variable The bipartite structure mak es Gibbs sampling adated deep Boltzmann at a time. RBMs allow all of the visible units to beinup updated in one blo block ckmachine and all efficien t. The naiv e approach to Gibbs sampling is to update only one v ariable of the hidden units to be up updated dated in a second blo blocck. One migh mightt naiv naively ely assume at a time. RBMs all requires of the visible units to be updated in one blo ckdating and all l la l + 1 up that a DBM with allow lay yers updates, dates, with each iteration up updating a of the hidden units to b e up dated in a second blo c k. One migh t naiv ely assume blo blocck consisting of one la lay yer of units. Instead, it is possible to update all of the l that a DBM with la y ers requires 1 updates, each iteration up dating a units in only tw two o iterations. Gibbsl + sampling canwith be divided in into to tw two o blo blocks cks of blo c k consisting of one la y er of units. Instead, it is p ossible to update all of the up updates, dates, one including all even lay layers ers (including the visible la lay yer) and the other units in only iterations. be divided intopattern, two blocks of including all otw ddo lay layers. ers. DueGibbs to thesampling bipartitecan DBM connection giv given en updates, oneers, including all even lay (including the visible layer) and thecan other the ev even en lay layers, the distribution overs er the odd lay layers ers is factorial and th thus us be including all o dd lay ers. Due to the bipartite DBM connection pattern, giv en sampled simultaneously and indep independen enden endently tly as a blo block. ck. Likewise, given the odd the ev en lay ers, the distribution o v er the o dd lay ers is factorial and th us can b la lay yers, the ev even en lay layers ers can be sampled simultaneously and indep independently endently as ae sampled simultaneously indep enden tly as a for block. Likewise, the odd blo blocck. Efficient samplingand is esp especially ecially imp important ortant training with given the sto stoc chastic la yers, um thelik eveliho en lay can be sampled simultaneously and independently as a maxim maximum likeliho elihoo oders algorithm. block. Efficient sampling is especially important for training with the stochastic maximum likelihood algorithm.
20.4.1
In Interesting teresting Properties
20.4.1 Interesting Properties Deep Boltzmann mac machines hines ha hav ve man many y in interesting teresting prop properties. erties. DBMs were developed after DBNs. Compared to DBNs, the posterior distribuDeep Boltzmann machines have many interesting properties. tion P (h | v ) is simpler for DBMs. Somewhat coun counterin terin terintuitiv tuitiv tuitively ely ely,, the simplicity of DBMs w ere developed after DBNs. Compared to DBNs, the posterior this posterior distribution allo allows ws richer appro approximations ximations of the posterior. In distributhe case P ( h v tion ) is simpler for DBMs. Somewhat coun terin tuitiv ely , the simplicity of of the DBN, we perform classification using a heuristically motiv motivated ated appro approximate ximate this posterior distribution allo of the posterior. the field case |pro inference procedure, cedure, in whic which hws wericher guessappro that ximations a reasonable value for the In mean of the DBN, of wethe perform classification a heuristically motiv ximate exp expectation ectation hidden units can busing e pro provided vided by an upw upward ard ated pass appro through the inference pro cedure, in whic h w e guess that a reasonable v alue for the mean field net netw work in an MLP that uses sigmoid activ activation ation functions and the same weigh eights ts exp ectation of the hidden units can b e pro vided by an upw ard pass through the as the original DBN. An may y be used to obtain a variational Any y distribution Q (h ) ma net w ork in an MLP that uses sigmoid activ ation functions and the same weigh ts lo low wer bound on the log-likelihoo log-likelihood. d. This heuristic procedure therefore allows us to Q (is h )not as the original DBN. An distribution maexplicitly y be usedoptimized to obtainina an variational obtain suc such h a bound. Ho How wyev ever, er, the bound any y way, so lo w er b ound on the log-likelihoo d. This heuristic procedure therefore allows us to the bound may be far from tigh tight. t. In particular, the heuristic estimate of Q ignores obtain such abetw bound. However, the within bound isthe notsame explicitly y way, so in interactions teractions between een hidden units la lay yer optimized as well as in theantop-down Q the b ound may b e far from tigh t. In particular, the heuristic estimate of feedbac feedback k influence of hidden units in deeper lay layers ers on hidden units that areignores closer interactions een hidden units within the same layer aspro well as the to the input.betw Because the heuristic MLP-based inference procedure cedure in top-down the DBN feedbac k influence of hidden unitsinteractions, in deeper laythe ers on hiddenQunits that are closer is not able to account for these resulting is presumably far to the input. Because the heuristic MLP-based inference procedure in the DBN is not able to account for these interactions, the resulting Q is presumably far 667
CHAPTER 20. DEEP GENERATIVE MODELS
from optimal. In DBMs, all of the hidden units within a lay layer er are conditionally indep independen enden endentt giv given en the other lay layers. ers. This lack of intra-la intra-layer yer interaction makes it from optimal. DBMs, all of the hidden units within a lay er vare conditionally p ossible to useIn fixed poin oint t equations to actually optimize the ariational lo low wer indep enden t giv en the other lay ers. This lack of intra-la yer interaction makes it bound and find the true optimal mean field exp expectations ectations (to within some numerical p ossible to use fixed p oin t equations to actually optimize the v ariational lo w er tolerance). bound and find the true optimal mean field expectations (to within some numerical The use of prop proper er mean field allo allows ws the approximate inference pro procedure cedure for tolerance). DBMs to capture the influence of top-do top-down wn feedbac feedback k in interactions. teractions. This mak makes es The use of prop er mean field allo ws the approximate inference pro cedure for DBMs interesting from the point of view of neuroscience, because the human brain DBMs to to capture the top-do influence top-do feedback in teractions. This mak es, is kno known wn use many top-down wn of feedbac feedback k wn connections. Because of this prop property erty erty, DBMs interesting from as thecomputational point of view of neuroscience, because the human brain DBMs ha hav ve been used models of real neuroscientific phenomena known many top-doetwnal.feedbac (isSeries et to al.,use 2010 ; Reichert , 2011).k connections. Because of this property, DBMs have been used as computational models of real neuroscientific phenomena One unfortunate prop propert ert erty y of DBMs is that sampling from them is relatively (Series et al., 2010; Reichert et al., 2011). difficult. DBNs only need to use MCMC sampling in their top pair of la layers. yers. The One y of thatsampling samplingpro from them is relatively other la lay yunfortunate ers are usedprop onlyertat theDBMs end ofis the process, cess, in one efficien efficientt difficult. DBNs only need to use MCMC sampling in their top pair of la yers. ancestral sampling pass. To generate a sample from a DBM, it is necessaryThe to other la y ers are used only at the end of the sampling pro cess, in one efficien use MCMC across all la lay yers, with ev every ery lay layer er of the mo model del participating in everyt ancestral sampling pass. To generate a sample from a DBM, it is necessary to Mark Marko ov chain transition. use MCMC across all layers, with every layer of the model participating in every Markov chain transition.
20.4.2
DBM Mean Field Inference
The conditional ov over erInference one DBM la lay yer giv given en the neighboring lay layers ers is 20.4.2 DBMdistribution Mean Field factorial. In the example of the DBM with tw two o hidden lay layers, ers, these distributions The conditional distribution ov er one DBM la y er giv en the neighboring lay ersall is (1) (1) (2) (2) (1) are P (v | h ), P ( h | v, h ) and P (h over | h ). The distribution factorial. exampledo ofesthe withbtw o hidden layers, these distributions hidden la lay yIn ersthe generally does notDBM factorize ecause of interac interactions tions bet etw ween la lay yers. all arethe , P ( htwo hidden ) yand The distribution overdue v h )with v, h la h | )v.) do P (example P((h h(1), h (2) In lay ers, P does es not factorize due hidden lateraction es(2) not factorize because of whic interac tions bthese etween layers. |yers generally |W | h(2) to the in interaction weigh eights tsdo bet etw ween h (1) and which h render variables P ( h , h v Inutually the example witht.two hidden layers, ) does not factorize due due m dep dependen enden endent. to the interaction weights W between h and h | which render these variables As was the case with the DBN, we are left to seek out metho methods ds to appro approximate ximate mutually dependent. the DBM posterior distribution. How Howev ev ever, er, unlike the DBN, the DBM posterior As was the case withhidden the DBN, we are leftcomplicated—is to seek out metho ds to to approximate approximate distribution over their units—while easy the DBM posteriorappro distribution. ever, unlike the DBN, DBM posterior with a variational approximation ximation How (as discussed in Sec. 19.4), the sp specifically ecifically a mean distribution o v er their hidden units—while complicated—is easy to approximate field approximation. The mean field approximation is a simple form of variational with a variational ximation (as discusseddistribution in Sec. 19.4to ), fully specifically a mean inference, where weappro restrict the approximating factorial distrifield approximation. The mean field approximation is a simple form of v ariational butions. In the con context text of DBMs, the mean field equations capture the bidirectional inference, where we fully factorial distriin interactions teractions betw etween eenrestrict la lay yers.the In approximating this section wedistribution derive the to iterative approximate butions. In the context of DBMs, theduced meaninfield equations capture theton bidirectional inference pro procedure cedure originally in intro tro troduced Salakh Salakhutdino utdino utdinov v and Hin Hinton (2009a). interactions between layers. In this section we derive the iterative approximate In variational approximations to inference, we approach the task of appro approxixiinference procedure originally introduced in Salakhutdinov and Hinton (2009a). mating a particular target distribution—in our case, the posterior distribution over In variational approximations to inference, we approach the task of approximating a particular target distribution—in 668 our case, the p osterior distribution over
CHAPTER 20. DEEP GENERATIVE MODELS
the hidden units given the visible units—by some reasonably simple family of distributions. In the case of the mean field appro approximation, ximation, the approximating family the hidden units given the visible units—by some reasonably simpleindep family of disis the set of distributions where the hidden units are conditionally independen enden endent. t. tributions. In the case of the mean field approximation, the approximating family Weset no now w distributions dev develop elop the mean approac approach h forare theconditionally example with twoenden hidden is the of wherefield the hidden units indep t. (1) (2) (1) (2) la lay yers. Let Q (h , h | v ) be the approximation of P ( h , h | v ). The mean e now develop the that mean field approach for the example with two hidden fieldWassumption implies v ) be the approximation v ). The mean layers. Let Q (h , h of P ( h , h Y Y (1) (2) (1) (2) field assumption Q implies (20.29) (h ,|hthat| v ) = Q(hj | v ) Q(hk | v ). | j
k
v) = v) v ). (20.29) Q(h , h Q(h Q(h The mean field approximation attempts |to find a mem memb | |ber of this family of (1) (2) distributions that best fits the true posterior P (h , h | v). Importantly Importantly,, the The mean field approximation attempts to find a mem b er of this family of Q ev inference pro process cess must be run again to find a differen differentt distribution every ery time Y Y ,h distributions that b est fits the true posterior ) . Importantly , the v P ( h we use a new value of v. Q every time inference process must be run again to find a different distribution | One can conceiv conceivee of many wa ways ys of measuring how well Q(h | v ) fits P ( h | v). we use a new value of v. The mean field approac approach h is to minimize One can conceive of many ways of measuring how well Q(h! v ) fits P ( h v). X The mean field approac h is to(1)minimize Q(h(1), h(2) | v ) | | KL(QkP ) = Q(h , h(2) | v ) log . (20.30) (1) (2) P (h , h | v ) h v) Q(h , h KL(Q P ) = v ) log Q(h , h . (20.30) P (h , h | v ) In general, kwe do not ha have ve to pro provide vide a parametric form of the approximating | | distribution bey eyond ond enforcing the indep independence endence assumptions. The variational In general, we do not ha ve to pro vide a parametric form of ! the approximating appro approximation ximation pro procedure cedure is generally able to reco recov ver a functional form of the X distribution b ey ond enforcing the indep endence assumptions. The variational appro approximate ximate distribution. How However, ever, in the case of a mean field assumption on appro ximation pro cedure is generally able to reco v er a functional form of the binary hidden units (the case we are dev developing eloping here) there is no loss of generality appro ximate distribution. How ever, in of ainmean field assumption on resulting from fixing a parametrization ofthe thecase mo model del adv advance. ance. binary hidden units (the case we are developing here) there is no loss of generality We parametrize Q as a pro product duct of Bernoulli distributions, that is we associate resulting from fixing a parametrization of the model in advance. the probabilit probability y of each element of h(1) with a parameter. Sp Specifically ecifically ecifically,, for each j , Q as a proˆduct (1)We parametrize (1) (1) of Bernoulli distributions, (2)that is we (2) associate ˆ ˆ h j = Q(h j = 1 | v), where hj ∈ [0 [0,, 1] and for eac each h k , h k = Q (h k = 1 | v), h the probabilit y of each element of with a parameter. Sp ecifically , for each j , (2) ˆ where h ∈ [0 , 1] . Thus we hav have e the follo following wing appro approximation ximation to the posterior: ˆ = Qk(h = 1 v), where h ˆ ˆ h [0 , 1] and for each k , h = Q (h = 1 v), Y Y ˆ (1) e the (2) | where h , h (2) [0 .| Thus follo Q(h(1) Q(we h j hav (h kwing | v, 1] )= | v) ∈ Q | v )approximation to the posterior: (20.31) ∈ j k Q(h , h h (1) v ) Q(h (1)v ) Y v ) = YQ((1) (20.31) (2) (1) (1−hj ) (2) (1−h(2)) hj ˆ ˆ ˆ (2)) hk (1 − ˆ k . (1 − × = ( h ) h ) ( h h ) j j k k | | | j k = (hˆ ) (1 ˆ h ) (hˆ ) (1 ˆ h ) (20.32) . − × − Y Y Of course, for DBMs with more lay layers ers the appro approximate ximate posterior parametrization (20.32) can be extended in the obvious wa way y, exploiting the bipartite structure of the graph Of course, for DBMsY with more layers the approximate Y posterior parametrization 669 can be extended in the obvious way, exploiting the bipartite structure of the graph
CHAPTER 20. DEEP GENERATIVE MODELS
to up update date all of the ev even en la lay yers sim simultaneously ultaneously and then to up update date all of the odd la lay yers sim simultaneously ultaneously ultaneously,, following the same schedule as Gibbs sampling. to update all of the even layers simultaneously and then to update all of the odd No Now w that we ha hav ve specified our family of appro approximating ximating distributions Q, it layers simultaneously, following the same schedule as Gibbs sampling. remains to specify a pro procedure cedure for cho hoosing osing the mem member ber of this family that best No w that w e ha v e specified our family of appro ximating distributions Q, it fits P . The most straightforw straightforward ard way to do this is to use the mean field equations remains a pro for choosing thederiv memed berbyofsolving this family that bthe est sp specified ecifiedto byspecify Eq. 19.56 . cedure These equations were derived for where fits Patives . The of most wabyound to doare thiszero. is toThey use the meane field deriv derivatives the straightforw variational ard low lower er describ describe in anequations abstract sp ecified b y Eq. 19.56 . These equations w ere deriv ed b y solving for where manner ho how w to optimize the variational lo low wer bound for any model, simplythe by deriv atives of the v ariational low er b ound are zero. They describ e in an abstract taking exp expectations ectations with resp respect ect to Q. manner how to optimize the variational lower bound for any model, simply by Applying these general equations, update date rules (again, ignoring taking expectations with resp ect to Qw . e obtain the up bias terms): Applying these general equations, we obtain the update rules (again, ignoring ! bias terms): X X (1) (2)ˆ (2) ˆ (1) = σ , ∀j (20.33) h vi W + W 0h 0 j
i,j
j,k
k0
i
ˆh = σ v W + W ˆ h X (2) (1) (2) ˆh = σ Wj0 ,k ˆ h j 0 , ∀k . k j0
k
,
j
∀
(20.33)
(20.34)
ˆh = σ , k. (20.34) W ˆ h ! X X At a fixed point of this system of equations, ∀ we hav havee a lo local cal maximum of the variational low lower er bound L (Q ). Thus these fixed point update equations define At a fixed point of this system equations, we havˆe (1) a local maximum of the we of an iterativ iterativee algorithm where alternateup updates dates of h j (using Eq. 20.33) and variational low er bound (Q X ). Thus these fixed point update equations define problems ˆh(2) (using Eq. 20.34). On small up updates dates of as MNIST, few k h such an iterative algorithm where of ˆ (using Eq. 20.33as ) and L we alternate updates as ten iterations can be sufficient to find an approximate positive phase gradient updates of ˆh and(using Eq. 20.34 ). On such asrepresen MNIST, as few for learning, fifty usually suffice to small obtainproblems a high quality representation tation of as ten iterations can b e sufficient to find an approximate p ositive phase gradient a single sp specific ecific example to be used for high-accuracy classification. Extending for learning, fifty usually suffice to obtain a high quality represen appro approximate ximate vand ariational inference to deep deeper er DBMs is straightforw straightforward. ard. tation of a single specific example to be used for high-accuracy classification. Extending approximate variational inference to deeper DBMs is straightforward.
20.4.3
DBM Parameter Learning
.20.4.3
DBM Parameter Learning
Learning in the DBM must confront both the challenge of an intractable . partition function, using the techniques from Chapter 18, and the challenge of an Learningposterior in the DBM must confront othhniques the challenge of an 19 intractable in intractable tractable distribution, using thebtec techniques from Chapter . partition function, using the techniques from Chapter 18, and the challenge of an As describ described ed in Sec. 20.4.2, variational allows construction of a intractable posterior distribution, using the inference techniques fromthe Chapter 19. distribution Q( h | v) that approximates the intractable P(h | v). Learning then As describ ed in Sec. L20.4.2 inference theon construction of a (v, Q,, θvariational pro proceeds ceeds by maximizing ), the variational lo low wallows er bound the in intractable tractable h v h v Q ( P ( distribution ) that approximates the intractable ) . Learning then log-lik log-likeliho eliho elihoo od, log P (v; θ ). (v, Q, θ ), the variational lower bound proceeds by maximizing | | on the intractable log-likelihood, log P (v; θL). 670
CHAPTER 20. DEEP GENERATIVE MODELS
For a deep Boltzmann machine with tw two o hidden lay layers, ers, L is giv given en by XX X X (1) (2) (2) (1) (1) orθa) = deep Boltzmann withˆ o hidden ˆ layers, is given by L(F Q, vi Wi,j0 hˆmachine + htw 0 j j 0 W j0 ,k0h k0 − log Z (θ ) + H (Q). (20.35) L i j0 j0 k 0 ˆ (Q, θ) = log Z (θ) + (Q). (20.35) v W hˆ + h W hˆ This a deep L expression still contains the log partition function, − log Z( θ). Because H Boltzmann mac machine hine contains restricted Boltzmann machines as comp components, onents, the This expression contains thethe logpartition partitionfunction function, a deep logsampling Z( θ). Because hardness resultsstill for computing and that apply to X X X X Boltzmann mac hine contains restricted Boltzmann machines as comp onents, the restricted Boltzmann mac machines hines also apply to deep Boltzmann machines. This means hardness results for computing partition function and sampling that to that ev evaluating aluating the probabilitythe mass function of a Boltzmann mac machine hine apply requires restricted Boltzmann mac hines also apply to deep Boltzmann machines. This means appro approximate ximate metho methods ds suc such h as annealed imp importance ortance sampling. Likewise, training thatmo evdel aluating theapproximations probability mass function of Boltzmann machine requires the model requires to the gradien gradient t ofa the log partition function. See appro ximate metho ds suc h as annealed imp ortance sampling. Likewise, training Chapter 18 for a general description of these metho methods. ds. DBMs are typically trained the mo del requires approximations to the gradien t of the logtechniques partition function. using sto stocchastic maxim maximum um lik likelihoo elihoo elihood. d. Many of the other describ described edSee in Chapter 18 for a general description of these metho ds. DBMs are t ypically trained Chapter 18 are not applicable. Techniques such as pseudolikelihoo pseudolikelihood d require the usingysto maxim likelihood. probabilities, Many of the other describ ed in abilit ability tochastic ev evaluate aluate the um unnormalized rathertechniques than merely obtain a Chapter 18 are not applicable. T echniques such as pseudolikelihoo d require the variational lo low wer bound on them. Contrastiv Contrastivee divergence is slo slow w for deep Boltzmann abilit y to ev aluate the unnormalized probabilities, rather than obtain a mac machines hines because they do not allo allow w efficien efficientt sampling of the hiddenmerely units giv given en the v ariational lo w er b ound on them. Contrastiv e divergence is slo w for deep Boltzmann visible units—instead, con contrastiv trastiv trastivee divergence would require burning in a Marko Markov v hines because not allo w efficien t sampling of the hidden units given the cmac hain ev every ery time athey newdo negativ negative e phase sample is needed. visible units—instead, contrastive divergence would require burning in a Markov The non-v non-variational ariational version of sto stochastic chastic maximum likelihoo likelihood d algorithm was chain every time a new negative phase sample is needed. discussed earlier, in Sec. 18.2. Variational sto stochastic chastic maximum lik likelihoo elihoo elihood d as applied The non-v ariational version of sto chastic maximum likelihoo d algorithm wast to the DBM is giv given en in Algorithm 20.1. Recall that we describ describee a simplified varien arient discussed earlier, Sec. 18.2.parameters; Variational including stochasticthem maximum likelihood as applied of the DBM thatinlac lacks ks bias is trivial. to the DBM is given in Algorithm 20.1. Recall that we describe a simplified varient of the DBM that lacks bias parameters; including them is trivial.
20.4.4
La Lay yer-Wise Pretraining
20.4.4 Lay, er-Wise Unfortunately Unfortunately, training a Pretraining DBM using sto stochastic chastic maximum likelihoo likelihood d (as describ described ed ab abo ove) from a random initialization usually results in failure. In some cases, the Unfortunately training a DBM using stochastic maximum likelihoo d (as cases, describthe ed mo model del fails to, learn to represen represent t the distribution adequately adequately. . In other above)may from a random usually failure. In some cases, the DBM represent theinitialization distribution well, butresults with noinhigher lik likelihoo elihoo elihood d than could mo del fails to learn t the distribution adequately .tsIninother cases, the b e obtained with justtoanrepresen RBM. A DBM with very small weigh eights all but the first DBM may represent the distribution well, but with no higher lik elihoo d than could la lay yer represen represents ts appro approximately ximately the same distribution as an RBM. be obtained with just an RBM. A DBM with very small weights in all but the first Various tec techniques hniques that permit join jointt training hav havee been dev developed eloped and are layer represents approximately the same distribution as an RBM. describ described ed in Sec. 20.4.5. How However, ever, the original and most popular metho method d for V arious tec hniques that p ermit join t training hav e been dev eloped and are overcoming the joint training problem of DBMs is greedy lay layer-wise er-wise pretraining. describ ed in Sec. 20.4.5 . How ever, the original and most popular metho d for In this metho method, d, each lay layer er of the DBM is trained in isolation as an RBM. The overcoming jointtotraining problem DBMs is greedy layter-wise first la lay yer is the trained mo model del the input of data. Eac Each h subsequen subsequent RBM ispretraining. trained to In this method,from eachthe layer of the RBM’s DBM isptrained isolation asAfter an RBM. mo model del samples previous osterior in distribution. all ofThe the first layer is trained to model the input data. Each subsequent RBM is trained to model samples from the previous RBM’s 671 p osterior distribution. After all of the
CHAPTER 20. DEEP GENERATIVE MODELS
stochastic chastic maxim maximum um likelihoo likelihood d algorithm for Algorithm 20.1 The variational sto training a DBM with two hidden la lay yers. Algorithm 20.1 The variational stochastic maximum likelihood algorithm for Set , the step size, to a small positiv ositivee num umb ber training a DBM with two hidden layers. Set k , the num umb ber of Gibbs steps, high enough to allow a Mark Marko ov chain of Set , the step size, to a small p ositiv e n um b er (1) (2) p( v, h , h ;θ + ∆θ ) to burn in, starting from samples from p (v, h(1), h(2) ; θ). k , thethree Set nummatrices, ber of Gibbs to m allow a Mark v crandom hain of (1) andhigh ˜ steps, ˜ (2)enough V˜ , H H Initialize eac columns setoto each h with p(alues v, h (e.g., , h from ;θ + Bernoulli ∆ ) to burn p (v, hmatched , h ; θto in, startingpfrom samples ). v distributions, ossibly with from marginals ˜ ˜ ˜ Initialize three matrices, V , H and H each with m columns set to random the mo model’s del’s marginals). v alues (e.g., from Bernoulli distributions, while not con conv verged (learning lo loop) op) do possibly with marginals matched to theSample model’s marginals). a minibatc minibatch h of m examples from the training data and arrange them while not con v erged (learning as the ro rows ws of a design matrixloop) V . do Sample a matrices minibatchH examples thetotraining data marginals. and arrange them (1)m ˆ of ˆ (2), pfrom H Initialize and ossibly the mo model’s del’s as the ro ws of a design matrix V . while not con conv (mean field inference lo loop) op) do verged ˆ ˆ , possibly to the model’s marginals. H H Initialize matrices and (2)> (1) (1) (2) ˆ ˆ W H . ← σ V W +H while not con verged (mean field inference loop) do (1)W (2) ˆ ˆ (2) ← σ H H Vˆ W + H. W . endˆ while ˆ W H ← 1σ H . ˆ (1) ∆W (1) ← m V >H end while ← 1 ˆ (1) > ˆ (2) ∆W (2) ← m H ˆ H ∆ V H for l = 1 to k (Gibbs sampling) do ˆ ˆ ∆ Gibbs ←blo blocH ck 1: H (1) ˜ (1) for∀i,l = do 1 to k (Gibbs sampling) ˜ ˜ ← j, V i,j sampled from P (V i,j = 1) = σ Wj,: Hi,: . Gibbs block 1: (2) (2) (1) (2) ˜ ˜ ˜ ˜ ˜ H = 1) = σ H W ∀i, j, H sampled from P ( V i,j sampled from P (V i,j= 1) = σ W i,:H :,j . . Gibbs blocck 2: ∀i, j, H ˜blo ˜ ˜ = 1) = σ H sampled from P ( H .˜ (2) (2)> ˜ (1) sampled ˜ (1) = ˜i,: WW(1) + H H 1) = σ V ∀i, j, H from P ( . i,j i,j :,j i,: W j,: Gibbs blo ck 2: ∀ end for˜ ˜ ˜ ˜ i, j, H sampled1 from . ˜ P(1)( H = 1) = σ V W + H W ∆W (1) ← ∆ W (1) − m V >H end ∀ for 1 ˜ (1)> ˜ (2) ∆W (2) ← ∆ W (2) − m H ˜H ∆ (1) V H ∆ (1) W is a carto cartoon on illustration, in practice use a more ← W + ∆ W (1) ˜ (this ∆ H H˜ ∆ − suc effectiv effective← e algorithm, such h as momen momentum tum witha deca decaying ying learning rate) W + ∆ (this is a carto on illustration, in practice use a more W (2) (2) ← − W ← W + ∆W (2) effectiv e algorithm, such as momentum with a decaying learning rate) ← end while W W + ∆ end while ← 672
CHAPTER 20. DEEP GENERATIVE MODELS
RBMs ha hav ve been trained in this wa way y, they can be combined to form a DBM. The DBM ma may y then be trained with PCD. Typically PCD training will make only a RBMschange have been trained in this way, they canitsbeperformance combined toasform a DBM. small in the model’s parameters and measured byThe the DBM ma y then b e trained with PCD. T ypically PCD training will make only a log-lik log-likeliho eliho elihoo od it assigns to the data, or its abilit ability y to classify inputs. See Fig. 20.4 small c hange in the model’s parameters and its p erformance as measured by the for an illustration of the training pro procedure. cedure. log-likelihood it assigns to the data, or its ability to classify inputs. See Fig. 20.4 This greedy lay layer-wise er-wise training pro procedure cedure is not just co coordinate ordinate ascen ascent. t. It bears for an illustration of the training procedure. some passing resem resemblance blance to co coordinate ordinate ascent because we optimize one subset of This greedy lay er-wise training pro notofjust ordinate t. training It bears the parameters at eac each h step. How Howev ev ever, er,cedure in the is case thecogreedy la lay yascen er-wise some passing resem blance to co ordinate ascent b ecause we optimize one subset of pro procedure, cedure, we actually use a differen differentt ob objectiv jectiv jectivee function at each step. the parameters at each step. However, in the case of the greedy layer-wise training Greedy w lay layer-wise er-wise pretraining oft aobDBM from la lay yer-wise preprocedure, e actually use a differen jectivediffers function atgreedy each step. training of a DBN. The parameters of eac each h individual RBM may be copied to Greedy lay er-wise pretraining of a DBM differs from greedy layer-wise prethe corresponding DBN directly directly.. In the case of the DBM, the RBM parameters training of a DBN. The parameters of eac h individual RBM may be copied to must be mo modified dified before inclusion in the DBM. A lay layer er in the middle of the stack theRBMs corresponding . In the case of the the stac RBM of is trainedDBN with directly only bottom-up input, butDBM, after the stack k isparameters combined m ust b e mo dified b efore inclusion in the DBM. A lay er in the middle of the stack to form the DBM, the la layer yer will ha have ve both bottom-up and top-do top-down wn input. T To o of RBMs is trained with only b ottom-up input, but after the stac k is combined accoun accountt for this effect, Salakhutdino Salakhutdinov v and Hinton (2009a) adv advo ocate dividing the to form the DBM, the la yer will ha ve b oth b ottom-up and top-do wn input. To weigh eights ts of all but the top and bottom RBM in half before inserting them into the accounA t dditionally for this effect, v and (2009a ) adv ocate dividing the DBM. dditionally, , theSalakhutdino bottom RBM mustHinton be trained using two “copies” of each weightsunit of all butthe thewtop and bottom half before the visible and eigh eights ts tied to be RBM equal in bet etw ween the tinserting wo copies.them Thisinto means DBM. the bottomdoubled RBM must bethe trained using twoSimilarly “copies”, the of each that theAdditionally weigh eights ts are, effectively during upw upward ard pass. Similarly, top visible unit and the w eigh ts tied to b e equal b et w een the t w o copies. This means RBM should be trained with two copies of the topmost la lay yer. that the weights are effectively doubled during the upward pass. Similarly, the top Obtaining the state of the art results with the deep Boltzmann mac machine hine requires RBM should be trained with two copies of the topmost layer. a mo modification dification of the standard SML algorithm, which is to use a small amount of Obtaining the the statenegative of the art results with the deep training Boltzmann hine requires mean field during phase of the join joint t PCD stepmac (Salakh Salakhutdino utdino utdinov v a mo dification of the standard SML algorithm, which is to use a small amount of and Hinton, 2009a). Sp Specifically ecifically ecifically,, the exp expectation ectation of the energy gradien gradientt should mean field during negative phase of the join t PCD training step (all Salakh utdino b e computed withthe resp respect ect to the mean field distribution in which of the unitsv andindep Hinton , 2009a ). Sp ecifically the exp ectation of the energy gradien t should are independen enden endent t from eac each h other., The parameters of this mean field distribution be computed with resp to thethe mean field distribution inequations which all for of the should be obtained by ect running mean field fixed point justunits one are indep enden t from eac h other. The parameters of this mean field distribution step. See Go Gooodfello dfellow w et al. (2013b) for a comparison of the performance of cen centered tered should with be obtained by running mean field fixed oint just one DBMs and without the usethe of partial mean fieldpin theequations negativ negativee for phase. step. See Goodfellow et al. (2013b) for a comparison of the performance of centered DBMs with and without the use of partial mean field in the negative phase.
20.4.5
Join Jointly tly Training Deep Boltzmann Machines
Classic greedy unsup unsupervised ervised pretraining,Machines and to perform classification 20.4.5DBMs Joinrequire tly Training Deep Boltzmann well, require a separate MLP-based classifier on top of the hidden features they Classic DBMs greedy unsupervised pretraining, extract. This require has some undesirable prop properties. erties. It is and hardtotoperform track pclassification erformance w ell, require a separate top of the they during training becauseMLP-based we cannot classifier ev evaluate aluate on prop properties erties of hidden the fullfeatures DBM while extract. the Thisfirst has RBM. some undesirable isw hard performance training Thus, it is prop harderties. to tell Itho how well to ourtrack hyperparameters during training because we cannot evaluate properties of the full DBM while training the first RBM. Thus, it is hard to tell how well our hyperparameters 673
CHAPTER 20. DEEP GENERATIVE MODELS
a)
c)
b)
d)
Figure 20.4: The deep Boltzmann mac machine hine training procedure used to classify the MNIST dataset (Salakhutdino Salakhutdinov v and Hin Hinton ton, 2009a; Sriv Srivastav astav astava a et al. al.,, 2014). (a) Train an RBM Figure 20.4: deep Boltzmann machinelog training procedure to classify the b y using CD The to approximately maximize Train aused second RBM th that at MNIST mo models dels logP P (v ). (b) (1) (1) an RBM al., 2014).log (a) dataset Salakhutdino Hinton , k2009a ; Srivastava etmaximize h P (Thrain and (target class yvband y using CDCD-k to approximately , y ) where log Pconditioned (v ). (b) Train by(1)using CD to approximately maximize second th atkmo dels1 is drawn from the first RBM’s posterior on athe data.RBM Increase from h h k log P ( h , ) target class y(c) byCom using CDmaximize y where to 20and during learning. Combine bine the tto woapproximately RBMs in into to a DBM. Train it to approximately h is drawn firststo RBM’s conditioned thek data. Increase maximize using stochastic chasticposterior maxim maximum um lik likeliho eliho elihoo o d on with = 55.. (d) Deletek yfrom from1 log logP Pfrom ( v, y)the (1) (into a DBM. Train it to approximately (c) to 20 during learning. Com bine the t w o RBMs h h the mo model. del. Define a new set of features and 2) that are obtained by running mean v, the log P(in k = 5.to(d)anDelete maximize y) using stolacking chastic ymaxim likeliho o d with from field inference model . Useum these features as input MLP ywhose h h 2) the mo del.is Define a new setadditional of featurespass ofand thatwith are obtained by running structure the same as an mean field, an additional output mean lay layer er fieldthe inference model lacking y . Use these features input to an MLPweigh whose for estimateinofthe y. Initialize the MLP’s weigh weights ts to b e the as same as the DBM’s weights. ts. structure the to same as an additional pass of withsto anchastic additional output layert logmean P (y |field, T rain the is MLP approximately maximize v ) using stochastic gradien gradient t descen descent for the estimate of y. reprin Initialize the MLP’s weigh b,e2013b the same and drop dropout. out. Figure reprinted ted from (Go Goo o dfello dfellow wtsettoal. al., ). as the DBM’s weights. Train the MLP to approximately maximize log P (y v ) using sto chastic gradient descent and drop out. Figure reprinted from (Go o dfellow et| al., 2013b). 674
CHAPTER 20. DEEP GENERATIVE MODELS
are working un until til quite late in the training pro process. cess. Softw Software are implemen implementations tations of DBMs need to hav havee many different comp components onents for CD training of individual are working til quite in the training process. based Softwon areback-propagation implementations RBMs, PCD un training of late the full DBM, and training of DBMsthe need to. hav e many for CD training of loses individual through MLP MLP. Finally Finally, , thedifferent MLP on comp top ofonents the Boltzmann mac machine hine many RBMs, PCD training of the full DBM, and training based on back-propagation of the adv advan an antages tages of the Boltzmann machine probabilistic model, such as being through MLP . Finally,when the MLP top v ofalues the Boltzmann able to pthe erform inference someon input are missing.machine loses many of the advantages of the Boltzmann machine probabilistic model, such as being There are two main wa ways ys to resolve the joint training problem of the deep able to perform inference when some input values are missing. Boltzmann machine. The first is the center entereed de deep ep Boltzmann machine (Monta Montav von There are t w o main wa ys to resolve the joint training problem of the deep and Muller, 2012), whic which h reparametrizes the mo model del in order to mak makee the Hessian of Boltzmann machine. The first is the at center deep Boltzmann machine (MontaThis von the cost function better-conditioned the ebdeginning of the learning process. and Muller ,del 2012 ), whic reparametrizes the amo del in lay order to mak e the Hessian of yields a mo model that canhbe trained without greedy layer-wise er-wise pretraining stage. the cost function bdel etter-conditioned att the eginning ofeliho theolearning process. This The resulting mo model obtains excellen excellent testbset log-lik log-likeliho elihoo d and pro produces duces high yields a mo del that can be trained without a greedy lay er-wise pretraining stage. qualit quality y samples. Unfortunately Unfortunately,, it remains unable to comp compete ete with appropriately The resulting model excellen t test set elihotrain od and produces high regularized MLPs as aobtains classifier. The second wa way ylog-lik to jointly a deep Boltzmann qualit y samples. , it remains unable to comp ete with mac machine hine is to useUnfortunately a multi-pr multi-preediction de deep ep Boltzmann machine (Go Goo oappropriately dfello dfellow w et al. al.,, regularized MLPs as a classifier. The second wa y to jointly train a deep Boltzmann 2013b 2013b). ). This mo model del uses an alternative training criterion that allows the use mac hine is to use a multi-pr edictionindeorder ep Boltzmann (Goodfello et al., of the bac back-propagation k-propagation algorithm to avoid machine the problems with w MCMC 2013b ). This mogradient. del uses Unfortunately an alternative, the training criteriondoes thatnot allows use estimates of the Unfortunately, new criterion lead the to go goo od of the bac k-propagation algorithm in order to a void the problems with MCMC lik likeliho eliho elihoo od or samples, but, compared to the MCMC approac approach, h, it do does es lead to estimates of the gradient. Unfortunately , the new criterion does not lead good sup superior erior classification performance and abilit ability y to reason well ab about out missingtoinputs. likelihood or samples, but, compared to the MCMC approach, it does lead to The cen centering tering tric trick for the Boltzmann is w easiest to missing describeinputs. if we superior classification pkerformance and abilitymachine to reason ell about return to the general view of a Boltzmann machine as consisting of a set of units x centering tric for biases the Boltzmann machine is easiest toenergy describe if we Ukand b . Recall from withThe a weigh eight t matrix Eq. 20.2 that he function return is giv given entobythe general view of a Boltzmann machine as consisting of a set of units x b . −Recall with a weight matrix U and biases x. 20.2 that he energy function (20.36) E (x) = x> U xfrom − b>Eq. is given by U , we can implemen Using diffe differen ren rentt sparsit sparsity y patterns weight implement x. (20.36)t E (x) =in the x Uweigh x bt matrix structures of Boltzmann mac machines, hines, suc such as RBMs, DBMs with different num numbers bers −hthe − t or U Using diffe ren t sparsit y patterns in weigh matrix , we can implemen of la lay yers. This is accomplished by partitioning x in into to visible and hidden units andt structures Boltzmann h asdoRBMs, or DBMs with different numbers zeroing outofelements of Umac forhines, unitssuc that not in The centered Boltzmann interact. teract. x infrom of lahine yers. in This is accomplished y partitioning to visible mac machine intro tro troduces duces a vector µ bthat is subtracted all ofand thehidden states: units and zeroing out elements of U for units that do not interact. The centered Boltzmann > states: machine introduces vector b. (20.37) , b) =µ−that (x −isµsubtracted )> U (x − µfrom ) − (xall−ofµ)the E 0(xa; U b. x; U , b) = (x fixed µ) at U (the x beginning µ) (x of µ)training. Typically µ is aEh(yperparameter It (20.37) is usu− the − model − is initialized. This ally chosen to make sure that−x −−µ ≈ 0 when T ypically µ is a hyperparameter beginningy of training. Itthat is usureparametrization does not changefixed the at set the of probabilit probability distributions the x µ 0 ally c hosen to make sure that when the model is initialized. This mo model del can represen represent, t, but it do does es change the dynamics of sto stochastic chastic gradient descen descentt reparametrization does not change the set of probabilit y distributions that the − ≈ applied to the likelihoo likelihood. d. Sp Specifically ecifically ecifically,, in many cases, this reparametrization results mo del can represen t, but it do es c hange the dynamics of sto chastic gradient descen in a Hessian matrix that is better conditioned. Melc Melchior hior et al. (2013) exp experimen erimen erimentally tallyt applied to the likelihood. Specifically, in many cases, this reparametrization results 675 in a Hessian matrix that is better conditioned. Melchior et al. (2013) experimentally
CHAPTER 20. DEEP GENERATIVE MODELS
confirmed that the conditioning of the Hessian matrix impro improv ves, and observ observed ed that the centering tric trick k is equiv equivalen alen alentt to another Boltzmann machine learning confirmed thatenhanc the conditioning the improved es,conditioning and observed tec technique, hnique, the enhance ed gr gradient adient (of Cho et Hessian al., 2011matrix ). The improv improved of that the centering tric k is equiv alen t to another Boltzmann machine learning the Hessian matrix allo allows ws learning to succeed, ev even en in difficult cases lik likee training a tec hnique, the enhanc e d gr adient ( Cho et al. , 2011 ). The improv ed conditioning of deep Boltzmann mac machine hine with multiple la lay yers. the Hessian matrix allows learning to succeed, even in difficult cases like training a The other approach to jointly training deep Boltzmann mac machines hines is the multideep Boltzmann machine with multiple layers. prediction deep Boltzmann mac machine hine (MP-DBM) whic which h works by viewing the mean The other approach to jointly training deep Boltzmann hines is thesolving multifield equations as defining a family of recurren recurrentt netw networks orks for mac approximately prediction deep Boltzmann mac hine (MP-DBM) whic h w orks by viewing the mean ev every ery possible inference problem (Go Goo odfellow et al., 2013b). Rather than training fieldmo equations as defining family ofd,recurren t netw orks fortoapproximately solvingt the model del to maximize thealikelihoo likelihood, the mo model del is trained make eac each h recurren recurrent every possible problem (er Goto odfellow et al.onding , 2013binference ). Ratherproblem. than training net netw work obtaininference an accurate answ answer the corresp corresponding The the mo del to maximize the likelihoo d, the mo del is trained to make eac h recurren training pro process cess is illustrated in Fig. 20.5. It consists of randomly sampling at net w ork obtain anrandomly accurate sampling answer toathe corresp ondingtoinference problem. The training example, subset of inputs the inference netw network, ork, training cess is the illustrated Fig. 20.5to. predict It consists randomly and thenpro training inferenceinnet netw work the of values of thesampling remaininga training example, randomly sampling a subset of inputs to the inference network, units. and then training the inference network to predict the values of the remaining This general principle of back-propagating through the computational graph units. for approximate inference has been applied to other mo models dels (Stoy Stoyano ano anov v et al. al.,, 2011; This principle of back-propagating through the computational Brak Brakel el etgeneral al. al.,, 2013 ). In these models and in the MP-DBM, the final loss graph is not for approximate inference has b een applied to other mo dels ( Stoy ano v et al. , 2011 the lo low wer bound on the likelihoo likelihood. d. Instead, the final loss is typically based on; Brakappro el et ximate al., 2013 ). In thesedistribution models andthat in the the final loss is not the approximate conditional the MP-DBM, approximate inference netw network ork the lo w er b ound on the likelihoo d. Instead, the final loss is typically based on imp imposes oses over the missing values. This means that the training of these models thesomewhat approximate conditional distribution that the approximate inference ork is heuristically motiv motivated. ated. If w e inspect the p(v ) represen represented tednetw by the imposes over the missing values. means it that thetotraining of thesedefective, models Boltzmann mac machine hine learned by theThis MP-DBM, tends be somewhat is somewhat heuristically motivated. e inspect the p(v ) represented by the in the sense that Gibbs sampling yieldsIfpw oor samples. Boltzmann machine learned by the MP-DBM, it tends to be somewhat defective, Bac Back-propagation k-propagation through the inference graph has tw two o main adv advantages. antages. First, in the sense that Gibbs sampling yields poor samples. it trains the model as it is really used—with approximate inference. This means k-propagation throughfor theexample, inferencetograph two main advantages. First, thatBac appro approximate ximate inference, fill inhas missing inputs, or to perform it trains the model it ispresence really used—with approximate inference. means classification despiteasthe of missing inputs, is more accurateThis in the MPthat appro ximate inference, for example, to fill in missing inputs, or to p erform DBM than in the original DBM. The original DBM do does es not mak makee an accurate classification despite of missingresults inputs,with is more in the w MPclassifier on its own; the the presence best classification the accurate original DBM ere DBM than in the original DBM. The original DBM do es not mak e an accurate based on training a separate classifier to use features extracted by the DBM, classifier on bits own; inference the best classification results withthe thedistribution original DBM ere rather than y using in the DBM to compute overwthe based on training a separate classifier to use features extracted by the DBM, class lab labels. els. Mean field inference in the MP-DBM performs well as a classifier rather than by using inference The in the DBM to compute the distribution othrough ver the without sp special ecial mo modifications. difications. other adv advan an antage tage of bac back-propagating k-propagating class lab els. Mean fieldis inference in the MP-DBM performs as gradien a classifier appro approximate ximate inference that bac back-propagation k-propagation computes thewell exact gradient t of without sp ecial mo difications. The other adv an tage of bac k-propagating through the loss. This is better for optimization than the approxima approximate te gradien gradients ts of SML appro ximate inference is that bac k-propagation computes the exact gradien t of training, which suffer from both bias and variance. This probably explains why MPthe loss. This betterjoin fortly optimization than the approxima teer-wise gradien ts of SML DBMs may be is trained jointly while DBMs require a greedy lay layer-wise pretraining. training, which suffer from both bias and variance. This probably explains why MPDBMs may be trained jointly while DBMs require a greedy layer-wise pretraining. 676
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.5: An illustration of the multi-prediction training pro process cess for a deep Boltzmann mac machine. hine. Eac Each h ro row w indicates a differen differentt example within a minibatch for the same training FigureEac 20.5: An illustration of multi-prediction training cessinference for a deeppro Boltzmann step. Each h column represen represents ts the a time step within the meanpro field process. cess. F For or mac hine. Eac h ro w indicates a differen t example within a minibatch for the same training eac each h example, we sample a subset of the data variables to serve as inputs to the inference step. EacThese h column represen a time black step within the mean field inference pro cess. or pro process. cess. variables aretsshaded to indicate conditioning. We then run F the each example, we sample a subset of the data variables to serve as inputs to the inference mean field inference pro process, cess, with arrows indicating which variables influence whic which h other cess. in These v ariables are shaded black to indicate conditioning. W e then run the vpro ariables the pro In practical applications, we unroll mean field for sev process. cess. several eral steps. mean field inference pro cess, with arrows indicating which v ariables influence whic h other In this illustration, we unroll for only two steps. Dashed arrows indicate ho how w the pro process cess vcould ariables in the pro cess. In steps. practical wethat unroll mean eral to steps. b e unrolled for more Theapplications, data variables were not field usedfor as sev inputs the In this illustration, we unroll for only two in steps. arrows indicate how the pro cess inference pro process cess b ecome targets, shaded gray gray.Dashed . We can view the inference pro process cess for could b e unrolled for more steps. The data v ariables that were not used as inputs to the eac each h example as a recurrent netw network. ork. We use gradien gradientt descen descentt and back-propagation to inference pro cess b ecome targets, gray . We can view given the inference pro cessThis for train these recurrent netw networks orks to shaded pro produce duceinthe correct targets their inputs. each example as afield recurrent We use gradien t duce descen t and back-propagation to trains the mean processnetw for ork. the MP-DBM to pro produce accurate estimates. Figure train these recurrent netw orks to pro duce the correct targets given their inputs. This adapted from Go Goo o dfellow et al. (2013b). trains the mean field process for the MP-DBM to pro duce accurate estimates. Figure adapted from Go o dfellow et al. (2013b). 677
CHAPTER 20. DEEP GENERATIVE MODELS
The disadv disadvantage antage of bac back-propagating k-propagating through the appro approximate ximate inference graph is that it do does es not provide a way to optimize the log-likelihoo log-likelihood, d, but rather a heuristic The disadv antage of bac k-propagating through the appro ximate inference graph is appro approximation ximation of the generalized pseudolik pseudolikeliho eliho elihoo od. that it does not provide a way to optimize the log-likelihood, but rather a heuristic k (eliho The MP-DBM inspired the NADENADE-k Raik Raiko al.,, 2014) extension to the appro ximation of the generalized pseudolik ood.et al. NADE framew framework, ork, whic which h is describ described ed in Sec. 20.10.10. The MP-DBM inspired the NADE-k (Raiko et al., 2014) extension to the The MP-DBM has some connections to dropout. Drop Dropout out shares the same paNADE framework, which is described in Sec. 20.10.10. rameters among many different computational graphs, with the difference betw etween een The MP-DBM has some connections to dropout. Drop out shares the same paeac each h graph being whether it includes or excludes eac each h unit. The MP-DBM also rameters among many different computationalgraphs. graphs, In with between shares parameters across man many y computational thethe casedifference of the MP-DBM, eachdifference graph being whether includes or excludes h unit. MP-DBM the bet etw w een the itgraphs is whether eac each heac input unitThe is observ observed ed or also not. shares parameters across man y computational graphs. In the case of the MP-DBM, When a unit is not observ observed, ed, the MP-DBM do does es not delete it entirely as in the the difference b et w een the graphs is whether eac is observ or not. case of drop dropout. out. Instead, the MP-DBM treats it ash ainput laten latenttunit variable to bed e inferred. When a unit is not observ ed, the MP-DBM do es not delete it entirely as in the One could imagine applying drop dropout out to the MP-DBM by additionally removing case ofunits droprather out. Instead, the MP-DBM treats some than making them laten latent. t. it as a latent variable to be inferred. One could imagine applying dropout to the MP-DBM by additionally removing some units rather than making them latent.
20.5
Boltzmann Mac Machines hines for Real-V Real-Valued alued Data
While mac machines hines werehines originally use with binary data, 20.5 Boltzmann Boltzmann Mac fordeveloped Real-Vfor alued Data man many y applications suc such h as image and audio mo modeling deling seem to require the abilit ability y While Boltzmann mac hines were originally developed for use with binary data, to represen representt probability distributions over real values. In some cases, it is possible man y applications h as andalaudio seem tothe require the abilit to treat real-v real-valued aluedsuc data in image the interv interval [0, 1]mo as deling representing exp expectation ectation of ay to represen t probability distributions ver real values. In some cases,initthe is p ossible binary variable. For example, Hin Hinton ton (o2000 ) treats grayscale images training to treat real-valued in theyinterv al [0,Each 1] aspixel representing the probability expectation of of aa set as defining [0,1]data probabilit probability values. defines the binary v ariable. F or example, Hin ton ( 2000 ) treats grayscale images in the training binary value being 1, and the binary pixels are all sampled independently from set defining probabilitprocedure y values. for Each pixel defines probability of a eac each hasother. This[0,1] is a common ev evaluating aluating binary the mo models dels on grayscale binarydatasets. value being 1,ever, anditthe binary pixels are theoretically all sampled independently from image How However, is not a particularly satisfying approach, eachbinary other. images This is sampled a common procedure aluating binary models on grayscale and indep independen enden endently tlyforinev this way ha hav ve a noisy app appearance. earance. In image datasets. How ever, it is not a particularly theoretically satisfying approach, this section, we presen presentt Boltzmann mac machines hines that define a probabilit probability y densit density y over and binary images sampled indep enden tly in this w a y ha v e a noisy app earance. In real-v real-valued alued data. this section, we present Boltzmann machines that define a probability density over real-valued data.
20.5.1
Gaussian-Bernoulli RBMs
Restricted Boltzmann machines ma may y be developed for many exp exponen onen onential tial family 20.5.1 Gaussian-Bernoulli RBMs conditional distributions (Welling et al. al.,, 2005). Of these, the most common is the Restricted Boltzmann machines ma y be developed forunits, manywith exponen tial family RBM with binary hidden units and real-v real-valued alued visible the conditional conditional distributions (Welling al., 2005 ). Of these, the mostwhose common is is the distribution over the visible units et being a Gaussian distribution mean a RBM with binary hidden units and real-v alued visible units, with the conditional function of the hidden units. distribution over the visible units being a Gaussian distribution whose mean is a There are many wa ways ys of parametrizing Gaussian-Bernoulli RBMs. First, we may function of the hidden units. 678Gaussian-Bernoulli RBMs. First, we may There are many ways of parametrizing
CHAPTER 20. DEEP GENERATIVE MODELS
cho hoose ose whether to use a cov covariance ariance matrix or a precision matrix for the Gaussian distribution. Here we presen presentt the precision form formulation. ulation. The modification to obtain cthe hoose whether to use a cov ariance matrix or a Gaussian co cov variance formulation is straigh straightforw tforw tforward. ard. precision W Wee wish matrix to ha hav vefor thethe conditional distribution. Here we present the precision formulation. The modification to obtain distribution the covariance formulationp(isv straigh e wish to have the conditional | h) = tforw N (v; ard. W h,W β−1 ). (20.38) distribution We can find the terms we pneed to) add the (v h = to (v;the W henergy , β ).function by expanding (20.38) unnormalized log conditional distribution: N to the energy function by expanding the We can find the terms we need| to add 1 unnormalized log conditional log N (v ;W h, β−1 ) = distribution: − (v − W h) > β (v − W h) + f (β). (20.39) 2 1 log (v ; W h, β ) = (v W h) β (v W h) + f (β). (20.39) Here f encapsulates all the terms that are a function only of the parameters 2 − discard f because its only and not theNrandom variables−in the−mo model. del. We can encapsulates all the termsand that a function only of of whatev the parameters roleHere is to fnormalize the distribution, theare partition function whatever er energy f and not the random v ariables in the mo del. W e can discard b ecause its only function we cho hoose ose will carry out that role. role is to normalize the distribution, and the partition function of whatever energy v from Eq. 20.39 If we include all of the terms (with their sign flipp flipped) ed) inv involving olving olvingv function we choose will carry out that role. in our energy function and do not add an any y other terms in inv volving v , then our energy If we include all of the terms (with their sign flipp ed) inv function will represen representt the desired conditional p(v | h). olving v from Eq. 20.39 in our energy function and do not add any other terms involving v , then our energy We hav have e some freedom regarding the otherpconditional distribution, p(h | v ). function will represen t the desired conditional (v h). Note that Eq. 20.39 con contains tains a term | We have some freedom regarding the other conditional distribution, p(h v ). 1 Note that Eq. 20.39 contains a term | h>W > βW h. (20.40) 2 1 hentirety W β h. This term cannot be included in its yW because it includes h ihj terms.(20.40) These 2 entiret corresp correspond ond to edges betw between een the hidden units. If we included these terms, we This term cannot b e included in del its entiret y bofecause it includes These h h terms. would hav havee a linear factor mo model instead a restricted Boltzmann mac machine. hine. correspdesigning ond to edges between the hidden we units. If womit e included terms, we h When our Boltzmann machine, simply these hthese cross terms. i j w ould havthem e a linear modelthe instead of a restricted Boltzmann p(v | h) so Omitting do does es factor not change conditional Eq. 20.39mac is hine. still h h When designing our Boltzmann machine, we simply omit these cross terms. resp respected. ected. Ho How wev ever, er, we still hav havee a choice ab about out whether to include the terms (v h) so Omitting them do es not c hange the conditional Eq. 20.39 is that still in inv volving only a single hi . If we assume a diagonalpprecision matrix, we find respeac ected. Howev er, hwewestill a choice about whether to include the terms | for each h hidden unit ha hav vhav e a eterm i involving only a single h . If we assume a diagonal precision matrix, we find that X 2 for each hidden unit h we have a1term hi β j Wj,i . (20.41) 2 j 1 h β W . (20.41) 2 2 In the ab abov ov ove, e, we used the fact that hi = hi because hi ∈ {0, 1}. If we include this term (with its sign flipp flipped) ed) in the energy function, then it will naturally bias h i h that = h unit h 0, 1 connected In the ab ov e, w e used the that because . If we include this to be turned off when the fact weigh weights ts for are large and to visible h term (with its sign flipped)The in the energy function, will bias X ∈ to {it include } naturally units with high precision. choice of whether or then not this bias term to b e turned off when the weigh ts for that unit are large and connected to visible do does es not affect the family of distributions the mo model del can represen representt (assuming that units with high precision. The choice of whether or not to include this bias term 679 the mo del can represent (assuming that does not affect the family of distributions
CHAPTER 20. DEEP GENERATIVE MODELS
we include bias parameters for the hidden units) but it does affect the learning dynamics of the mo model. del. Including the term may help the hidden unit activ activations ations w e include bias parameters for the hidden units) but it does affect the learning remain reasonable ev even en when the weigh eights ts rapidly increase in magnitude. dynamics of the model. Including the term may help the hidden unit activations Onereasonable way to define energy on a Gaussian-Bernoulli RBM is th thus us remain eventhe when the function weights rapidly increase in magnitude. 1 One way to define the energy function on a Gaussian-Bernoulli RBM is thus E (v, h) = v > (β v ) − (v β)> W h − b>h (20.42) 2 1 E (vextra , h) =terms v (β v ) (v β) W h b h (20.42) but we may also add 2 or parametrize the energy in terms of the variance − rather than precision if we cho hoose. ose. − but we may also add extra terms or parametrize the energy in terms of the variance In this derivation, ation, w have rather thanderiv precision ifewhav e cehonot ose.included a bias term on the visible units, but one could easily be added. One final source of variability in the parametrization of a In this derivation, we hav e not included awbias termthe on the visiblematrix. units, but one Gaussian-Bernoulli RBM is the choice of ho how to treat precision It may could easily be to added. One tfinal sourceestimated of variability thethe parametrization of a either be fixed a constan constant (p (perhaps erhaps basedinon marginal precision Gaussian-Bernoulli RBM is the choice of ho w to treat the precision matrix. It may of the data) or learned. It may also be a scalar times the iden identity tity matrix, or it either to a matrix. constantTypically (perhapswe estimated basedthe on precision the marginal precision ma may y bebea fixed diagonal do not allow matrix to be of the data) or learned. It may also b e a scalar times the iden tity matrix, or it non-diagonal in this con context, text, because some op operations erations would then require in inv verting maymatrix. be a diagonal wewill do not the precision to be the In thematrix. sectionsTypically ahead, we see allow that other forms ofmatrix Boltzmann non-diagonal in this con text, b ecause some op erations w ould then require in v erting mac machines hines permit mo modeling deling the co cov variance structure, using various techniques to the In the sections ahead, we will see that other forms of Boltzmann avoidmatrix. in erting the precision matrix. inv v machines permit modeling the covariance structure, using various techniques to avoid inverting the precision matrix.
20.5.2
Undirected Mo Models dels of Conditional Cov Covariance ariance
20.5.2the Undirected Mo dels of the Conditional Covariance While Gaussian RBM has been canonical energy mo model del for real-v real-valued alued data, Ranzato et al. (2010a) argue that the Gaussian RBM inductiv inductivee bias is not While the Gaussian RBM has b een the canonical energy mo delreal-v for real-v well suited to the statistical variations presen presentt in some types of real-valued alued alued data, data, Ranzato et al. (2010aThe ) argue that the Gaussian RBM e biasconten is nott esp especially ecially natural images. problem is that muc uch h of the inductiv information content w ell suited to the statistical variations tvin some btet ypes of pixels real-vrather alued data, presen present t in natural images is em emb bedded inpresen the co cov ariance etw ween than esp ecially natural images. The problem is that m uc h of the information conten in the ra raw w pixel values. In other words, it is the relationships betw etween een pixels andt presen t in natural images is em b edded in the co v ariance b et w een pixels rather than not their absolute values where most of the useful information in images resides. in thethe rawGaussian pixel values. other words, is the relationships etwinput een pixels Since RBMInonly models theit conditional mean of b the given and the not their absolute v alues where most of the useful information in images resides. hidden units, it cannot capture conditional cov covariance ariance information. In resp response onse Since the Gaussian RBM only models the conditional mean of the input given the to these criticisms, alternative mo models dels ha have ve been proposed that attempt to better hidden units, it cannot capture conditional cov ariance information. In resp onse accoun accountt for the cov covariance ariance of real-v real-valued alued data. These mo models dels include the mean and to these criticisms, alternative mo dels ha ve been proposed that attempt to mo better 1 t co ariance RBM (mcRBM ), the mean-pro of -distribution (mP cov v mean-product duct (mPoT) oT) model del accoun the covariance of real-v alued data. These models include the mean and and thet for spik spike e and slab RBM (ssRBM). covariance RBM (mcRBM ), the mean-product of t -distribution (mPoT) model 1 is pronounced by saying the name of the letters M-C-R-B-M; the “mc” andThe theterm spik“mcRBM” e and slab RBM (ssRBM). is not pronounced like the “Mc” in “McDonald’s.”
680
CHAPTER 20. DEEP GENERATIVE MODELS
indepenenMean and Cov Covariance ariance RBM The mcRBM uses its hidden units to indep den dently tly enco encode de the conditional mean and cov covariance ariance of all observ observed ed units. The Mean and Covla ariance RBMintoThe itsmean hidden units indep enmcRBM hidden lay yer is divided tw twoomcRBM groups ofuses units: units andtoco cov variance dently The encogroup de thethat conditional mean and covariance all observ ed units.RBM. The units. mo models dels the conditional mean isofsimply a Gaussian mcRBM hidden la y er is divided into tw o groups of units: mean units and co v ariance The other half is a cov covariance ariance RBM (Ranzato et al., 2010a), also called a cRBM, units. comp The group mothe delsconditional the conditional meanstructure, is simplyasa describ Gaussian whose componen onen onents ts that mo model del cov covariance ariance described ed bRBM. elow. The other half is a covariance RBM (Ranzato et al. , 2010a ), also called a cRBM, Sp Specifically ecifically ecifically,, with binary mean units h(m) and binary cov covariance ariance units h (c) , the whose components model the conditional covariance structure, as described below. mcRBM mo model del is defined as the com combination bination of two energy functions: Specifically, with binary mean units h and binary covariance units h , the ( m ) ( c ) m)two energy (functions: mcRBM model isEdefined (20.43) h com ) =bination E m(x, h (of ) + E c(x, h c) ), mc (x, h as ,the 2 ), E (x, h Gaussian-Bernoulli , h ) = E (x, h RBM ) + energy E (x, hfunction: where E m is the standard
(20.43)
X X (mfunction: is the standard1Gaussian-Bernoulli RBM (m) energy ) (m) E m(x, h(m) ) = − x>x − x > W:,j hj − bj hj , (20.44) 2 j j 1 E (x, h ) = x x x W h b h , (20.44) 2 and Ec is the cRBM energy function that models the conditional cov covariance ariance − − − information: and E is the cRBM energy function that models the conditional covariance X X information: E (x, h(c) ) = − 1 Xh(c) x> r (j) 2 −X b(c)h(c) . (20.45) c j j j 2 j 1 j E (x, h ) = h x r b h . (20.45) 2 The parameter r(j) corresp corresponds onds covariance ariance weigh weight associated ciated with − to the cov − t vector asso (c) ( c ) hj and b is a vector of cov covariance ariance offsets. The combined energy function defines The parameter r corresponds to the covariance weight vector associated with a join distribution: jointt The combined Xoffsets. X energy function defines h and b is a vector of covariance n o 1 a joint distribution: pmc(x, h(m) , h(c) ) = exp −Emc (x, h(m), h(c)) , (20.46) Z 1 p (x, h , h ) = exp (20.46) E (x, h , h ) , and a corresp corresponding onding conditional distribution ov over er the observ observations ations given h(m) and Z h(c) as a multiv ultivariate ariate Gaussian distribution:− and a corresponding conditional distribution ations and over the observ given h h as a multivariate Gaussian distribution: X n (m) o pmc (x | h(m) , h(c) ) = N Cxmc W :,j hj , Cmc (20.47) |h x|h . where E
j
(x h , h ) = C W h ,C . (20.47) P −1 | variance matrix N Cmc = (j ) (j )> + I Note that the co cov is non-diagonal j hj r r x|h and that W is the weigh weightt matrixassociated the Gaussian RBM mo modeling deling the with h r r + I is non-diagonal Note that the covariance matrix C = X 2 theweigh Gaussian-Bernoulli RBM function the image data the has energy assumes W is of andThis thatversion the t matrixassociated with the Gaussian RBM modeling p
zero mean, per pixel. Pixel offsets can easily be added to the model to account for nonzero pixel means.
P
681
CHAPTER 20. DEEP GENERATIVE MODELS
conditional means. It is difficult to train the mcRBM via contrastiv contrastivee divergence or persisten ersistentt con contrastiv trastiv trastivee div divergence ergence because of its non-diagonal conditional cov covariance ariance conditional means. It is difficult to train the mcRBM via contrastiv e divergence ( m c) structure. CD and PCD require sampling from the join jointt distribution of x, h ), h(or persisten trastiveRBM, divergence because ofby itsGibbs non-diagonal covariance whic which, h, in tacon standard is accomplished samplingconditional over the conditionals. structure. CD and PCD require sampling from the join t distribution of x , h ,h Ho How wev ever, er, in the mcRBM, sampling from pmc( x | h(m) , h (c)) requires computing in at a standard RBM, isofaccomplished bycan Gibbs over the conditionals. (whic Cmch,)−1 ev every ery iteration learning. This be sampling an impractical computational p Hinton ( x h (2010 , h ) )avoid However, the mcRBM, sampling fromand requires computing burden forinlarger observ observations. ations. Ranzato direct sampling (from C )the at ev ery iteration of learning. This can b e an impractical computational ( m ) ( c ) | conditional pmc( x | h , h ) by sampling directly from the marginal burden for observations. Ranzato Hinton voidthe direct sampling Hamiltonian (h (hybrid) ybrid) Mon Monte te and Carlo (Neal(,2010 1993) )aon mcRBM free p(x) using larger from the energy energy. . conditional p ( x h , h ) by sampling directly from the marginal Monte Carlo (Neal, 1993) on the mcRBM free p(x) using Hamiltonian (hybrid) | energy. mean-product duct of Studen Student’s t’s Mean-Pro Mean-Product duct of Studen Student’s t’s t-distributions The mean-pro t-distribution (mP (mPoT) oT) mo model del (Ranzato et al., 2010b) extends the PoT mo model del (Welling Mean-Pro duct of Studen t’s -distributions The mean-pro duct of Studen t’s t et al. al.,, 2003a) in a manner similar to how the mcRBM extends the cRBM. This t-distribution oT) model (Ranzato et al., 2010b ) extends PoT moofdelGaussian (Welling is ac achieved hieved by(mP including nonzero Gaussian means by thethe addition et al., 2003a ) in aunits. manner to howthe thePoT mcRBM extends the cRBM. RBM-lik RBM-like e hidden Lik Likeesimilar the mcRBM, conditional distribution overThis the is ac hieved by including nonzero Gaussian means b y the addition of Gaussian observ observation ation is a multiv multivariate ariate Gaussian (with non-diagonal cov covariance) ariance) distribution; RBM-lik e hidden units. Lik e the mcRBM, the PoT conditional distribution oover ver the the ho how wev ever, er, unlik unlikee the mcRBM, the complemen complementary tary conditional distribution observation is a multiv ariate Gaussian (with non-diagonal covariance) distribution; hidden variables is given by conditionally indep distributions. The independen enden endentt Gamma ho w ev er, unlik e the mcRBM, the complemen tary conditional distribution o ver the Gamma distribution G( k, θ) is a probabilit probability y distribution ov over er positive real num umb bers, hidden variables by conditionally enden t Gamma distributions. with mean not necessary to ha have veindep a more detailed understanding of The the kθ. Itisisgiven ( k , θ Gamma distribution ) is a probabilit y distribution ov er p ositive real n um b ers, Gamma distribution to understand the basic ideas underlying the mP mPoT oT mo model. del. with mean kθ. It is not G necessary to have a more detailed understanding of the The mP mPoT oT energy function is: Gamma distribution to understand the basic ideas underlying the mPoT model. ) EThe (x,oT h(menergy , h(c) ) function is: (20.48) mPoTmP X 1 (j)> 2 (c) (c) E (x,= h E m, (hx, h)(m) ) + (20.48) + (1 − γj ) log hj hj 1+ r x 2 j 1 + (1 γ ) log h(20.49) = E (x, h ) + h 1+ r x 2 − (c) ( j ) h j and Em ( x(20.49) , h(m)) where r is the cov covariance ariance weigh eightt vector asso associated ciated with unit unith is as defined in Eq. 20.44. withunit Xt h and E ( x, h ) where r is the covariance weigh vector associated Just as with the mcRBM, the mPoT mo model del energy function sp specifies ecifies a mulis as defined in Eq. 20.44. tiv tivariate ariate Gaussian, with a conditional distribution ov over er x that has non-diagonal Just as with the mcRBM, the mo del energy function specifies a mul(mPoT c) co cov variance. The cov covariance ariance units h are conditionally Gamma-distributed: tivariate Gaussian, with a conditional distribution has non-diagonal over xthat 2 1 ( c ) covariance. The covpariance units h are conditionally v (j)>Gamma-distributed: x (20.50) mPoT(hj | x) = G γ j , 1 + 2 1 p mo (hdel—again, , 1 +mcRBM—is v x complicated by(20.50) x) = likeγ the Learning in the mPoT model—again, the in2 ( m ) | G abilit ability y to sample from the non-diagonal Gaussian conditional pmPoT(x | h , h (c)), Learning in the mPoT model—again, like the mcRBM—is complicated by the in(x h , h ), ability to sample from the non-diagonal 682 Gaussian conditional p |
CHAPTER 20. DEEP GENERATIVE MODELS
so Ranzato et al. (2010b) also adv advo ocate direct sampling of p(x ) via Hamiltonian (h (hybrid) ybrid) Mon Monte te Carlo. so Ranzato et al. (2010b) also advocate direct sampling of p(x ) via Hamiltonian (hybrid) Monte Carlo. Spikee and slab restricted Spik Spike e and Slab Restricted Boltzmann Mac Machines hines Spik Boltzmann mac machines hines (Courville et al., 2011) or ssRBMs provide another means Spik e and Slab Restricted Boltzmann Mac hines e and slab restricted of mo modeling deling the co cov variance structure of real-v real-valued alued data.Spik Compared to mcRBMs, Boltzmann mac (Courville et al., 2011 ) or matrix ssRBMs provide another means ssRBMs hav have e thehines adv advan an antage tage of requiring neither inv inversion ersion nor Hamiltonian of motedeling covariance real-valued data.the Compared mcRBMs, Mon Monte Carlothe methods. As structure a model ofofnatural images, ssRBM istoin interesting teresting ssRBMs hav e the adv an tage of requiring neither matrix inv ersion nor Hamiltonian in that, like the mcRBM and the mP mPoT oT model, its binary hidden units enco encode de Mon te Carlo methods. As a model of natural images, the ssRBM is in teresting the conditional co cov variance across pixels through the use of auxiliary real-v real-valued alued that, like the mcRBM and the mPoT model, its binary hidden units encode vin ariables. the conditional covariance across pixels through the use of auxiliary real-valued The spike and slab RBM has tw two o sets of hidden units: binary spike units h, variables. and real-v real-valued alued slab units s . The mean of the visible units conditioned on the The spike slab units h, > o sets of hidden units: binary W )Wtw hidden units isand giv given en byRBM (h shas . In other words, eac each h column spike :,i defines a s .inThe and real-v units the hvisible conditioned the comp componen onen onenttalued that slab can app appear ear the mean input of when 1.. units The corresp corresponding ondingonspike i =1 )W comp is given bwhether y (h sthat . In onen othert iswords, eacath all. column defines a vhidden ariable units hi determines componen onent present The W corresp corresponding onding h compvonen t that can appear the input = 1. Theifcorresp onding spike inintensit slab ariable si determines the intensity y of when that component, it is presen present. t. When vaariable h determines whether that comp onen t is present at all. The corresp onding spike variable is activ active, e, the corresp corresponding onding slab variable adds variance to the slab variable s determines y ofallows that component, it is presen t. When W:,i. This input along the axis definedthe by intensit us to mo model delifthe cov covariance ariance of the a spike v ariable is activ e, the corresp onding slab v ariable adds v ariance to the inputs. Fortunately ortunately,, contrastiv contrastivee divergence and persistent contrastiv contrastivee divergence W inputGibbs along sampling the axis defined . This There allows isusno to need modeltothe of the with are stillbyapplicable. in inv vcov ert ariance an any y matrix. inputs. Fortunately, contrastive divergence and persistent contrastive divergence ormally, , the ssRBM mo model del is definedThere via itsisenergy function: withFormally Gibbs sampling are still applicable. no need to invert any matrix. ! X X 1 >via its energy function: Formally , the ssRBMx > mo is defined E Wdel Λ+ Φ ih i x (20.51) ss(x, s, h) = − :,is i hi + x 2 i i 1 E (x, s, h) = 1 X x W s hX+ x Λ +X Φ h X x (20.51) 2 , (20.52) − + αi s2i − α2 µ s h b h + α µ h i i i i i i i i i −2 i i i 1 i (20.52) + αs αµsh bh + α µ h, where bi is the offset2of the spik spikee h i and Λ is a diagonal precision matrix on the ! − α i > 0 is a scalar − Xprecision parameter for the observ observations ations x. TheX parameter b h Λ where is the offset of the spik e and is a precision matrix matrix on the real-v real-valued alued slab variable si. The parameter Φi isdiagonal a non-negativ non-negative e diagonal observ ationsanx.h-mo Thedulated parameter is ay scalar precision parameter for the α > p0enalt x. Eac that defines -modulated on X Each h µX i is a mean parameter X quadratic X enalty real-v variable for thealued slab slab variable si . s . The parameter Φ is a non-negative diagonal matrix that defines an h-modulated quadratic penalty on x. Each µ is a mean parameter With the join jointt distribution defined via the energy function, it is relatively for the slab variable s . straigh straightforw tforw tforward ard to deriv derivee the ssRBM conditional distributions. For example, With the join t distribution defined svia theconditional energy function, it is relatively by marginalizing out the slab variables , the distribution over the straigh tforw ard to deriv e the ssRBM conditional distributions. F or example, observ observations ations given the binary spike variables h is giv given en by: by marginalizing out the slab variablesZs, the conditional distribution over the 1 1 observations given is Egiv pss the (x | binary h) = spike variables exph{− {−E (xen , sby: , h)} ds (20.53) P (h) Z 1 1 p (x h) = E (x, s, h) ds (20.53) 683 exp P (h) Z | {− } Z
CHAPTER 20. DEEP GENERATIVE MODELS
= N
C ss x|h
i
!
W:,i µi hi , Css x|h
(20.54)
W µh , C (20.54) = Λ + i Φ ih i − where The last equality holds only if ss the co cov variance matrix Cx|h is positiv ositivee definite. = Λ+ Φ h α hW W where C . The last equality holds only if ! distribution Gating b y the spik spike e v ariables means that the true marginal ov over er X the covariance matrix C −is positive definite. h s is sparse. This is differen differentt from sparse coding, where samples from the mo model del Gating by (in the the spikmeasure e variables meanssense) that the truezeros marginal over “almost nev never” er” theoretic con contain tain in thedistribution code, and MAP h s is sparse. t sparsit from sparse coding, where samples from the model ThisPis P inference is required todifferen imp impose ose sparsity y. “almost nev er” (in the measure theoretic sense) con tain zeros in the code, and MAP Comparing the ssRBM to the mcRBM and the mPoT mo models, dels, the ssRBM inference is required to impose sparsity. parametrizes the conditional cov covariance ariance of the observ observation ation in a significan significantly tly differen differentt Comparing the ssRBM to the mcRBM and the mPoT mo dels, the ssRBM way. The mcRBM and mP mPoT oT both model the co cov variance structure of the observ observation ation −1 parametrizes conditional cov ariance of the observ ation in a significan tly differen t P (c) (the j ) r(j )> + I asay. The usingmodel the activ activation ation of the hiddenofunits 0 to hj > ation j h mcRBM j r w and mPoT, both the co variance structure the observ (j ) enforce constrain constraints cov variance direction . Inhcontrast, as , using theco activ ationinofthe the hiddenrunits h r r ts on+the I conditional > 0 to the ssRBM sp specifies ecifies the conditional cov covariance ariance of the observ observations ations using the hidden r . In contrast, enforce constrain covariance in the direction h i on spik spikee activ activations ations ts = 1the to conditional pinch the precision matrix along the direction specified the ssRBM sp ecifies the conditional cov ariance of the observ ations using the hidden by the corresp corresponding onding weight vector. The ssRBM conditional co cov variance is very h spik e activ ations = 1 to pinch the precision matrix along the direction specified similar model: del: the pro product duct of probabilistic principal P to that given by a different mo b y the corresp onding w eight vector. The ssRBM conditional variance is very comp componen onen onents ts analysis (PoPPCA) (Williams and Agako Agakov v, 2002). Incothe overcomplete similar tosparse that given by a different del: theparametrization product of probabilistic principal setting, activ with themo ssRBM permit significant activations ations onen ts oanalysis (PoPPCA) (Williams Agako v, 2002 ). Inselected the overcomplete −1 vcomp ariance (ab (abo ve the nominal variance given and by Λ ) only in the directions setting, sparse activ ations with the ssRBM parametrization p ermit significant of the sparsely activ activated ated hi . In the mcRBM or mPoT mo models, dels, an overcomplete Λ vrepresen ariance (ab o v e the nominal v ariance given by ) only in the selecteddirection directions representation tation would mean that to capture variation in a particular in h of sparsely activ ated . In the mcRBM or mPoT mo dels, an o v ercomplete thethe observ space requires remo p otentially all constraints with p ositive observation ation removing ving represen tation would mean that capture variation a particular direction in pro projection jection in that direction. Thistowould suggest thatinthese mo models dels are less well the observ ation space requires removing potentially all constraints with positive suited to the overcomplete setting. pro jection in that direction. This would suggest that these models are less well The primary disadv disadvantage antage of the spike and slab restricted Boltzmann mac machine hine suited to the overcomplete setting. is that some settings of the parameters can corresp correspond ond to a co cov variance matrix primary disadv antageSuch of the spike and slab restricted machine thatThe is not positiv ositive e definite. a co cov variance matrix places Boltzmann more unnormalized is that some theare parameters canthe corresp ond to a cothe variance matrix probabilit probability y onsettings values of that farther from mean, causing in integral tegral ov over er that is not p ositiv e definite. Such a co v ariance matrix places more unnormalized all possible outcomes to div diverge. erge. Generally this issue can be av avoided oided with simple probabilit y on v alues that are farther from the mean, causing integralUsing over heuristic tricks. There is not yet an any y theoretically satisfying the solution. all possible outcomes to div Generally canwhere be avoided with simple constrained optimization toerge. explicitly av avoid oidthis theissue regions the probability is heuristic tricks. There is not yet an y theoretically satisfying solution. Using undefined is difficult to do without being overly conserv conservativ ativ ativee and also prev preven en enting ting constrained optimization to explicitly av oid the regions where the probability is the mo model del from accessing high-p high-performing erforming regions of parameter space. undefined is difficult to do without being overly conservative and also preventing Qualitativ Qualitatively ely ely,, conv convolutional olutional varian ariants ts of the ssRBM pro produce duce excellen excellentt samples the model from accessing high-performing regions of parameter space. of natural images. Some examples are sho shown wn in Fig. 16.1. Qualitatively, convolutional variants of the ssRBM produce excellent samples of natural images. Some examples are sho wn in Fig. 16.1. 684 C ss x|h
P
=
C
X
> −1 α−1 . i hi W :,iW:,i iN
P
CHAPTER 20. DEEP GENERATIVE MODELS
The ssRBM allo allows ws for several extensions. Including higher-order interactions and average-po erage-pooling oling of the slab variables (Courville et al., 2014) enables the mo model del The ssRBM allo ws for several extensions. Including higher-order interactions to learn excellent features for a classifier when labeled data is scarce. Adding a and avtoerage-po oling function of the slab variables (Courville et al., function 2014) enables model term the energy that preven prevents ts the partition from the becoming to learn excellent features for a classifier when labeled data is scarce. A dding a undefined results in a sparse co coding ding model, spike and slab sparse co coding ding (Goo Goodfellow dfellow term the ), energy function et al. al.,,to 2013d also known as that S3C.prevents the partition function from becoming undefined results in a sparse coding model, spike and slab sparse coding (Goodfellow et al., 2013d), also known as S3C.
20.6
Con Conv volutional Boltzmann Mac Machines hines
As seen inCon Chapter 9, extremely high dimensional such h as images place 20.6 volutional Boltzmann Macinputs hinessuc great strain on the computation, memory and statistical requirements of mac machine hine As seen in Chapter 9 , extremely high dimensional inputs suc h as images place learning mo models. dels. Replacing matrix multiplication by discrete conv convolution olution with a great strain on the computation, memory and statistical requirements machav hinee small kernel is the standard way of solving these problems for inputsofthat have learning mo dels. Replacing matrix multiplication by discrete conv olution with translation in inv variant spatial or temporal structure. Desjardins and Bengio (2008a) small is the standard way w ofellsolving these problems for inputs that have sho show wedkernel that this approac approach h works when applied to RBMs. translation invariant spatial or temporal structure. Desjardins and Bengio (2008) Deep con conv volutional netw networks orks usually require a pooling operation so that the showed that this approach works well when applied to RBMs. spatial size of each successiv successivee la lay yer decreases. Feedforward con conv volutional net netw works Deep con v olutional netw orks usually require a p o oling operation so that the often use a pooling function such as the maxim maximum um of the elements to be pooled. spatial size ofho each successiv e laythis er decreases. Feedforward convolutional networks It is unclear how w to generalize to the setting of energy-based mo models. dels. We often use a p o oling function such as the maxim um of the elements to b e p o oled. could introduce a binary pooling unit p over n binary detector units d and enforce unclear how to generalize this to the setting ofwhenever energy-based models. Wise pIt=is max er that constraint i d i by setting the energy function to b e ∞ whenev n binary could introduce a es binary pooling unit p over as detector units d and enforcet violated. This do does not scale well though, it requires ev evaluating aluating 2n differen different p = max d by settingto the energy the function to be whenev er 3 energy configurations compute normalization constan constant. t. that For aconstraint small 3 × is violated. This do es not scale well though, as it requires ev aluating 2 differen 9 ∞ pooling region this requires 2 = 512 energy function ev evaluations aluations per pooling unit!t energy configurations to compute the normalization constant. For a small 3 3 Lee region et al. (this 2009requires ) dev develop elop eloped solution this problem called per pr prob ob obabilistic max pooling 2 ed=a512 energytofunction evaluations pabilistic ooling unit! × pooling (not to be confused with “sto “stochastic chastic pooling,” whic which h is a tec technique hnique for Lee et constructing al. (2009) devensembles eloped a solution to this problem called obabilistic max implicitly of conv convolutional olutional feedforw feedforward ard pr netw networks). orks). The pooling (not to be confused with chastic whic h isdetector a technique strategy behind probabilistic max“sto pooling is ptoooling,” constrain the units for so implicitly constructing ensembles of conv olutional feedforw ard netw orks). The at most one ma may y be activ activee at a time. This means there are only n + 1 total strategy b ehind probabilistic poolingunits is tobeing constrain theandetector units so states (one state for eac each h of themax on, and additional state n detector n + 1is total at mostonding one ma be ofactiv at a time. This means only unit corresp corresponding toy all thee detector units being off off). ).there The are pooling on if states (one state for eac h of the detector units b eing on, and an additional state n and only if one of the detector units is on. The state with all units off is assigned correspzero. onding all think of theofdetector units being offdel ). with The p oling vunit is on if energy Wetocan this as describing a mo model a osingle ariable that and only if one of the detector units is on. The state with all units off is assigned has n + 1 states, or equiv equivalen alen alently tly as a mo model del that has n + 1 variables that assigns energy zero. can n think of this as describing model with a single variable that energy ∞ to W alle but + 1 join joint t assignmen assignments ts of avariables. has n + 1 states, or equivalently as a model that has n + 1 variables that assigns While efficien efficient, t, probabilistic max pooling do does es force the detector units to be energy to all but n + 1 joint assignments of variables. mutually exclusiv exclusive, e, which may be a useful regularizing constrain constraintt in some contexts ∞ efficient, probabilistic max pooling does force the detector units to be While or a harmful limit on mo model del capacity in other con contexts. texts. It also do does es not supp support ort mutually exclusive, which may be a useful regularizing constraint in some contexts 685 or a harmful limit on model capacity in other contexts. It also does not support
CHAPTER 20. DEEP GENERATIVE MODELS
overlapping po pooling oling regions. Ov Overlapping erlapping po pooling oling regions are usually required to obtain the best performance from feedforward conv convolutional olutional netw networks, orks, so this o v erlapping po oling regions. Ov erlapping po oling regions are usually required constrain constraintt probably greatly reduces the performance of con conv volutional Boltzmann to obtain mac machines. hines. the best performance from feedforward convolutional networks, so this constraint probably greatly reduces the performance of convolutional Boltzmann Lee et al. (2009) demonstrated that probabilistic max pooling could be used machines. to build con conv volutional deep Boltzmann machines.3 This mo model del is able to perform Lee et al. ( 2009 ) demonstrated that probabilistic max p oling could be used op operations erations suc such h as filling in missing portions of its input. oWhile intellectually to build conthis volutional Boltzmann model isand ableusually to perform app appealing, ealing, mo model del deep is challenging to machines. make workThis in practice, do does es op erations suc h as filling in missing p ortions of its input. While intellectually not perform as well as a classifier as traditional con conv volutional netw networks orks trained appealing, this mo del is challenging to make work in practice, and usually does with sup supervised ervised learning. not perform as well as a classifier as traditional convolutional networks trained Man Many y conv convolutional olutional mo models dels work equally well with inputs of man many y differen differentt with supervised learning. spatial sizes. For Boltzmann machines, it is difficult to change the input size Man y conv dels work equally well with inputs man differen for a variet ariety y ofolutional reasons. mo The partition function changes as theofsize ofythe inputt Fvor machines, it orks is difficult the input size cspatial hanges.sizes. Moreo Moreov er, Boltzmann man many y conv convolutional olutional netw networks achiev achieveeto sizechange in inv variance by scaling for the a variet y of reasons. partition function hanges the input, size ofbut thescaling input up size of their poolingThe regions proportional tocthe size as of the cBoltzmann hanges. Moreo v er, man y conv olutional netw orks achiev e size in v ariance b y scaling mac machine hine pooling regions is awkw awkward. ard. Traditional con conv volutional neural up the size of their p o oling regions proportional to the size of the input, but scaling net netw works can use a fixed num umber ber of pooling units and dynamically increase the Boltzmann ooling regions is awkw ard. T volutional neural size of theirmac pohine olingpregions in order to obtain a raditional fixed-size con representation of a works can input. use a fixed number of mac pooling units dynamically the vnet ariable-sized For Boltzmann machines, hines, largeand pooling regions increase become to too o size of their p o oling regions in order to obtain a fixed-size representation of exp expensiv ensiv ensivee for the naiv naivee approac approach. h. The approac approach h of Lee et al. (2009) of makinga veac ariable-sized input. F or Boltzmann mac hines, large pomutually oling regions become too each h of the detector units in the same po pooling oling region exclusiv exclusive e solves expensiv e for the naiv e approac The approac h of vLee et al. (2009 ) of making the computational problems, buth.still do does es not allow ariable-size pooling regions. eac h of the detector units in the same po oling region mutually exclusiv e For example, supp suppose ose we learn a model with 2 × 2 probabilistic max poolingsolves ov over er the computational problems, but still do es not allow v ariable-size p o oling regions. detector units that learn edge detectors. This enforces the constraint that only For example, suppose learn in a model with 2 2 probabilistic max pothe oling ovof er one of these edges ma may ywe appear eac each h 2× 2 region. If we then increase size detector learninedge This the only ×enforces the input units imagethat by 50% eac each h detectors. direction, we would exp expect ect theconstraint num number ber ofthat edges to one of these edges ma y appear in eac h 2 2 region. If we then increase the size of increase corresp correspondingly ondingly ondingly.. Instead, if we increase the size of the pooling regions by the input image by 50% direction, we would expectconstrain the numtber of sp edges to ×utual 50% in each direction to in 3 ×eac 3, hthen the m exclusivity constraint no now w specifies ecifies increase corresp ondingly . Instead, if we increase the size of the p o oling regions b that each of these edges ma may y only appear once in a 3 × 3 region. As we gro grow wy in eachinput direction toin 3 this 3, then utual exclusivity t no w sp ecifies . a50% model’s image wa way y,the themmodel generates constrain edges with less density density. thatcourse, each of these edges maarise y only appear 3 use 3 region. we groof w × Of these issues only when the once mo model delinmaust variableAs amounts a omodel’s image way, the model generates less density. ×edges p oling in input order to emitina this fixed-size output vector. Mo Models dels that with use probabilistic Of course, these issues only arise when the mo del m ust use v ariable amounts max pooling ma may y still accept variable-sized input images so long as the output of of p ooling in order to emit a fixed-size output v ector. Mo dels that use probabilistic the mo is a feature map that can scale in size prop to the input image. model del proportional ortional max pooling may still accept variable-sized input images so long as the output of Pixels at the boundary of the image also pose some difficult difficulty y, which is exacthe model is a feature map that can scale in size proportional to the input image. 3
The publication the of model a “deepalso beliefpnetwork” becauseyit can be described Pixels at the describes boundary theasimage ose somebutdifficult , which is exacas a purely undirected model with tractable layer-wise mean field fixed point updates, it best fits the definition of a deep Boltzmann machine. 686
CHAPTER 20. DEEP GENERATIVE MODELS
erbated by the fact that connections in a Boltzmann machine are symmetric. If we do not implicitly zero-pad the input, then there are fewer hidden units than erbatedunits, by the fact connections in aboundary Boltzmann machine If visible and thethat visible units at the of the imageare aresymmetric. not mo modeled deled well e do not implicitly zero-pad the input, then there are fewer units than w because they lie in the receptiv receptive e field of fewer hidden units. hidden How Howev ev ever, er, if we do visible units, and the visible units at the boundary of the image are not mo deled implicitly zero-pad the input, then the hidden units at the boundary are driv driven en by w ellerbecause they lieand in the fieldate of when fewer hidden few fewer input pixels, ma may yreceptiv fail to eactiv activate needed.units. However, if we do implicitly zero-pad the input, then the hidden units at the boundary are driven by fewer input pixels, and may fail to activate when needed.
20.7
Boltzmann Machines for Structured or Sequen Sequential tial Outputs 20.7 Boltzmann Machines for Structured or Sequential In the structured output scenario, we wish to train a mo model del that can map from Outputs some input x to some output y, and the differen differentt en entries tries of y are related to eac each h In the structured output scenario, w e wish to train a mo del that can map from other and must ob obey ey some constraints. For example, in the speech synthesis task, x y, wand some input to some output the differen t entries are related to each y is a waveform, and the en entire tire aveform must sound lik likeeofaycoheren coherent t utterance. other and must obey some constraints. For example, in the speech synthesis task, A natural wa way y to represent the relationships betw between een the en entries tries in y is to y is a waveform, and the entire waveform must sound like a coherent utterance. use a probabilit probability y distribution p(y | x). Boltzmann machines, extended to mo model del y A natural wa y to represent the relationships betw een the en tries in is to conditional distributions, can supply this probabilistic mo model. del. use a probability distribution p(y x). Boltzmann machines, extended to model The same to tool ol of conditional mo modeling deling with a Boltzmann mac machine hine can be used conditional distributions, can supply | this probabilistic model. not just for structured output tasks, but also for sequence mo modeling. deling. In the latter samethan toolmapping of conditional modeling a Boltzmann mac be used x to anwith y , the mo case,The rather an input output model del hine mustcan estimate a not just for structured output tasks, but also for sequence mo deling. In the latter (1) ( τ ) probabilit probability y distribution ov over er a sequence of variables, p(x , . . . , x ). Conditional x to anofoutput y , the case, rather than mapping an input must (t) |del p( xmo , . . . ,estimate x (t−1)) ina x(1) Boltzmann mac machines hines can represen represent t factors the form probabilit y distribution er a sequence of variables, p(x , . . . , x ). Conditional order to accomplish thisovtask. x ,...,x Boltzmann machines can represent factors of the form p( x ) in An imp important ortant sequence modeling task for the video game and film industry order to accomplish this task. | is mo modeling deling sequences of join jointt angles of sk skeletons eletons used to render 3-D characters. Ansequences importantare sequence modelingusing taskmotion for the capture video game and to film industry These often collected systems record the is mo deling sequences t angles ofmodel skeletons to render 3-D characters. mo mov vemen ements ts of actors. of A join probabilistic of aused character’s mov movement ement allows These sequencesofare often collected unseen, using motion capture animations. systems to record thee the generation new, previously but realistic To solv solve movemen ts ofmo actors. probabilistic model of) aintroduced character’sa mov ement allows this sequence modeling deling Atask, Taylor et al. (2007 conditional RBM the generation of new, previously unseen, but realistic animations. T o ( t− 1) ( t − m ) ( t ) ,...,x mo modeling deling p (x | x ) for small m. The mo model del is an RBM solv ov over ere modeling task, T aylor et al.function (2007) introduced a conditional )) whose bias pthis (x(tsequence m values RBM parameters are a linear of the preceding of x. x p ( x , . . . , x m mo deling ) for small . The mo del is an RBM er ( t− 1) When we condition on different values of x and earlier variables, we get a ov new p(x )ovwhose m values of on x. bias| weigh parameters areRBM a linear the preceding x nev RBM er x. The weights ts in the ov over erfunction never er cofhange, but by conditioning x When w e condition on different v alues of and earlier v ariables, we get a new differen differentt past values, we can change the probabilit probability y of different hidden units in the x x RBM o v er . The weigh ts in the RBM ov er nev er change, subsets but by conditioning on RBM being active. By activ activating ating and deactiv deactivating ating different of hidden units, differen past values,changes we can to change the probabilit y of different hiddenon units in the x . Other w e can tmak make e large the probabilit probability y distribution induced being active. By activ ating and et deactiv ating) and different of hidden units, vRBM ariants of conditional RBM (Mnih al., 2011 othersubsets variants of sequence we can make large changes to the probability distribution induced on x . Other variants of conditional RBM (Mnih et al., 2011) and other variants of sequence 687
CHAPTER 20. DEEP GENERATIVE MODELS
mo modeling deling using conditional RBMs are possible (Taylor and Hin Hinton ton, 2009; Sutskev Sutskever er et al. al.,, 2009; Boulanger-Lewando Boulanger-Lewandowski wski et al. al.,, 2012). modeling using conditional RBMs are possible (Taylor and Hinton, 2009; Sutskever Another sequence mo modeling deling task is to model the distribution ov over er sequences et al., 2009; Boulanger-Lewandowski et al., 2012). of musical notes used to comp compose ose songs. Boulanger-Lewando Boulanger-Lewandowski wski et al. (2012) Another sequence mo deling task is to model the distribution over The sequences in intro tro troduced duced the RNN-RBM sequence mo model del and applied it to this task. RNNof musical notes used to comp songs. Boulanger-Lewando wski et 2012) RBM is a generativ generative e model of ose a sequence of frames x(t) consisting of al. an (RNN introduced the RBM RNN-RBM sequence del and step. applied it to the this mo task. RNNthat emits the parameters for mo eac each h time Unlike model del The describ described ed x RBM is a generativ e model of a sequence of frames consisting of an RNN ab abo ove, the RNN emits all of the parameters of the RBM, including the weights. that emits themo RBM parameters for able each to time step. Unlike the del describ ed To train the model, del, w e need to be back-propagate the mo gradien gradient t of the abovfunction e, the RNN emitsthe all RNN. of the The parameters of the isRBM, including the weights. loss through loss function not applied directly to the T o train the mo del, w e need to be able to back-propagate the gradien t of RNN outputs. Instead, it is applied to the RBM. This means that we mthe ust loss function through the RNN. The loss function is not applied directly to the appro approximately ximately differentiate the loss with respect to the RBM parameters using RNN outputs. Instead, is applied to theThis RBM. This means thatt ma wey m ust con contrastiv trastiv trastive e div divergence ergence or aitrelated algorithm. approximate gradien gradient may then appro ximately differentiate with respect to the RBM parameters using b e bac back-propagated k-propagated throughthe the loss RNN using the usual bac back-propagation k-propagation through contrastiv e divergence or a related algorithm. This approximate gradient may then time algorithm. be back-propagated through the RNN using the usual back-propagation through time algorithm.
20.8
Other Boltzmann Mac Machines hines
Man Many variants of Boltzmann machines are possible. 20.8y other Other Boltzmann Machines machines ma may y be extended withare differen different t training criteria. We ha hav ve ManBoltzmann y other variants of Boltzmann machines possible. fo focused cused on Boltzmann mac machines hines trained to approximately maximize the generative Boltzmann machines ma e extended with discriminative different training criteria. e hato ve log p ( v criterion ). It is alsoy pbossible to train RBMs thatWaim focused onlog Boltzmann machines trained approximately the generative p( y | v) instead maximize (Laro Larochelle chelletoand Bengio, 2008maximize ). This approac approach h often log p ( v criterion ) . It is also p ossible to train discriminative RBMs that aimand to performs the best when using a linear com combination bination of both the generative y v log p ( maximize ) instead ( Laro chelle and Bengio , 2008 ). This approac h often the discriminativ discriminativee criteria. Unfortunately Unfortunately,, RBMs do not seem to be as pow owerful erful p erforms the b est when using a linear com bination of both the generative and | sup supervised ervised learners as MLPs, at least using existing metho methodology dology dology.. the discriminative criteria. Unfortunately, RBMs do not seem to be as powerful Most Boltzmann mac machines hines used in practice hav havee only second-order interactions supervised learners as MLPs, at least using existing methodology. in their energy functions, meaning that their energy functions are the sum of many Most machines used in practice only second-order terms andBoltzmann each individual term only includes hav thee pro product duct betw etween een tinteractions wo random their energy that their energy functions are the sum of train many vinariables. An functions, example ofmeaning suc a term is . It is also p ossible to v W h such h i i,j j terms and each individual term includes the )pro duct energy between two random higher-order Boltzmann mac machines hinesonly (Sejnowski , 1987 whose function terms vin ariables. An example of suc h a term is . It is also p ossible to train v W h inv volv olvee the products betw etween een many variables. Three-wa Three-way y interactions betw etween een a higher-order Boltzmann mac hines ( Sejnowski , 1987 ) whose energy function terms hidden unit and two differen differentt images can mo model del spatial transformations from one in v olv e the products b etw een many v ariables. Three-wa y interactions between frame of video to the next (Memisevic and Hinton , 2007, 2010 ). Multiplication by a a hidden unit two differen t images can model spatial fromunits one one-hot classand variable can change the relationship betw etween eentransformations visible and hidden frame of video to the class next is (Memisevic and Hinton , 2007 2010).).One Multiplication by a dep depending ending on which presen presentt (Nair and Hinton , ,2009 recent example one-hot class v ariable can c hange the relationship b etw een visible and hidden units of the use of higher-order interactions is a Boltzmann machine with two groups of depending on which class is presen t (Nairunits and Hinton ,teract 2009).with Onebrecent hidden units, with one group of hidden that in interact oth theexample visible of the use of higher-order interactions is a Boltzmann machine with two groups of hidden units, with one group of hidden688 units that interact with both the visible
CHAPTER 20. DEEP GENERATIVE MODELS
units v and the class lab label el y, and another group of hidden units that interact only with the v input values (Luo et al., 2011). This can be interpreted as encouraging units v and the class el yto , and another groupusing of hidden units that interact only some hidden units to lab learn mo model del the input features that are relev relevant ant to v with the input v alues ( Luo et al. , 2011 ). This can b e interpreted as encouraging the class but also to learn extra hidden units that explain nuisance details that somenecessary hidden units to samples learn to mo input using that are relev to are for the of vdel to the be realistic butfeatures do not determine theant class thethe class but also to learn extra hidden unitsin that explain nuisance details that of example. Another use of higher-order interactions teractions is to gate some features. v to be realistic are necessary for)the samples aofBoltzmann butwith do not determine the class Sohn et al. (2013 introduced machine third-order in interactions teractions of thebinary example. ofciated higher-order interactions is to gate these some masking features. with maskAnother variablesuse asso associated with each visible unit. When Sohn et al. ( 2013 ) introduced a Boltzmann machine with third-order in teractions variables are set to zero, they remov removee the influence of a visible unit on the hidden with mask ariablesunits associated with visible When these masking units.binary This allo allows wsvvisible that are noteach relev relevant ant tounit. the classification problem vto ariables are set to zero, they remov e the influence of a visible unit on the hidden be remo remov ved from the inference path pathw way that estimates the class. units. This allows visible units that are not relevant to the classification problem More generally generally,, the Boltzmann mac machine hine framework is a rich space of mo models dels to be removed from the inference pathway that estimates the class. permitting man many y more mo model del structures than hav havee been explored so far. Developing More generally , the Boltzmann hine some framework a rich of mo dels a new form of Boltzmann mac machine hine mac requires more is care andspace creativit creativity y than p ermitting y more del structures hav explored dev developing eloping aman new neuralmo net netw work la lay yer, bthan ecause iteisbeen often difficultsotofar. findDeveloping an energy a new form of Boltzmann mac hine requires some more care and creativit y than function that maintains tractability of all of the differen differentt conditional distributions dev eloping a new neural net w ork la y er, b ecause it is often difficult to find an energy needed to use the Boltzmann machine, but despite this required effort the field function op that tractability of all of the different conditional distributions remains open en maintains to inno innov vation. needed to use the Boltzmann machine, but despite this required effort the field remains open to innovation.
20.9
Bac Back-Propagation k-Propagation through Random Op Operations erations
T raditional neural net a deterministic transformation of some netw works implement 20.9 Bac k-Propagation through Random Op erations input variables x. When developing generative mo models, dels, we often wish to extend T raditional neural net w orks implement a deterministic of some x. One straigh neural netw networks orks to implemen implementt sto stochastic chastic transformations oftransformation straightforw tforw tforward ard x input v ariables . When developing generative mo dels, w e often wish to extend way to do this is to augmen augmentt the neural net netw work with extra inputs z that are neural netw orks to implemen t sto chastic transformations One straigh tforward sampled from some simple probability distribution, suc such h of asxa. uniform or Gaussian z that are w ay to do thisThe is toneural augmen twthe netwcontin ork with inputs distribution. net netw ork neural can then continue ue toextra perform deterministic sampled from internally some simple probability distribution, as aear uniform or Gaussian f ( x, z ) suc computation internally, , but the function willh app appear sto stochastic chastic to an distribution. The neural net w ork can then contin ue to p erform deterministic observ observer er who do does es not ha hav ve access to z. Pro Provided vided that f is contin continuous uous and f ( x , z computation internally , but the function ) will app ear sto chastic an differen differentiable, tiable, we can then compute the gradients necessary for training to using observ er who does have access to z. Provided that f is continuous and bac back-propagation k-propagation as not usual. differentiable, we can then compute the gradients necessary for training using As an example,aslet us consider the op operation eration consisting of drawing samples y back-propagation usual. from a Gaussian distribution with mean µ and variance σ2: As an example, let us consider the operation consisting of drawing samples y ∼ N (µ, σ2 ). variance σ : from a Gaussian distribution withy mean µ and (20.55) σ duced ). (20.55) Because an individual sample of y is not(µ,pro produced by a function, but rather by a sampling pro process cess whose output c∼ hanges ev every ery time we query it, it ma may y seem N Because an individual sample of y is not pro duced by a function, but rather coun counterin terin terintuitiv tuitiv tuitivee to tak takee the deriv derivatives atives of y with respect to the parameters by of a sampling process whose output changes every time we query it, it may seem 689of y with respect to the parameters of counterintuitive to take the derivatives
CHAPTER 20. DEEP GENERATIVE MODELS
its distribution, µ and σ2 . Ho How wev ever, er, w wee can rewrite the sampling pro process cess as transforming an underlying random value z ∼ N (z ; 0, 1) to obtain a sample from µ and σ . However, we can rewrite the sampling process as its distribution, the desired distribution: transforming an underlying randomy v=alue from µ +zσz (z ; 0, 1) to obtain a sample (20.56) the desired distribution: ∼N y = through µ + σz the sampling op We are no now w able to bac back-propagate k-propagate operation, eration, by (20.56) regarding it as a deterministic op operation eration with an extra input z. Crucially Crucially,, the extra input W e are no w able to bac k-propagate through the sampling eration, regardis a random variable whose distribution is not a function of op any of theby variables ing it as a deterministic op eration with an extra input z . Crucially , the extra input whose deriv derivatives atives we wan wantt to calculate. The result tells us how an infinitesimal a random is we notcould a function of the op variables cishange in µ orvariable changedistribution the output if rep repeat eat of theany sampling operation eration σ wouldwhose whose deriv atives w e wan t to calculate. The result tells us how an infinitesimal again with the same value of z. change in µ or σ would change the output if we could repeat the sampling operation Being back-propagate through this sampling op operation eration allows us to again withable the to same value of z. incorp incorporate orate it into a larger graph. We can build elements of the graph on top of the Being ablesampling to back-propagate this sampling operation us to output of the distribution.through For example, we can compute theallows deriv derivativ ativ atives es incorp orate it into a larger graph. W e can build elements of the graph on top of the of some loss function J (y). We can also build elements of the graph whose outputs output the sampling distribution. Forsampling example,op we can compute the deriv es are the of inputs or the parameters of the operation. eration. For example, we ativ could of some loss function ) . W e can also build elements of the graph whose outputs J ( y build a larger graph with µ = f( x; θ) and σ = g(x ; θ). In this augmented graph, are the use inputs or the parameters of thethese sampling operation. Fore example, w e can bac back-propagation k-propagation through functions to deriv derive ∇ θJ (y). we could build a larger graph with µ = f( x; θ) and σ = g(x ; θ). In this augmented graph, Theuse principle used in this through Gaussianthese sampling example is more generally appliwe can back-propagation functions to deriv e J (y). cable. We can express an any y probability distribution of the form p (y; θ) or p( y | x; θ ) ∇ generally appliThe principle used in Gaussian sampling is more θ , and as p (y | ω ), where ω is a vthis ariable con containing taining both example parameters if applicable, p ()y,; where θ) or pω( yma cable. We can expressa an y probability of the pform xy; θin) x . Given y sampleddistribution (y | ω the inputs value from distribution may as p (ybe ω , where ωofisother a variable containing parameters θ , and if applicable, | turn a )function variables, we canboth rewrite x . Given a value y sampled from distribution p(y ω ), where ω may in the inputs | turn be a function of other variables, y ∼we p(ycan | ωrewrite ) (20.57) | as
y p(y ω) ∼ f (z;|ω ), y=
(20.57) (20.58)
as y with where z is a source of randomness. yW= e ma may derivativ ativ atives es of(20.58) f (zy; then ω ), compute the deriv resp respect ect to ω using traditional tools such as the back-propagation algorithm applied z long where is a source randomness. e may then compute the derivativCrucially es of y with f to , so as f isofcontin continuous uous andW differentiable almost everywhere. Crucially, ,ω ω resp ect to using traditional tools such as the back-propagation algorithm applied must not be a function of z , and z must not be a function of ω . This technique is to f , so longthe as frep isar contin uous and differentiable everywhere. ,ω often called epar arametrization ametrization trick , sto stochastic chastic almost back-pr ack-prop op opagation agation or Crucially perturb erturbation ation z z ω m ust not b e a function of , and m ust not b e a function of . This technique is analysis analysis.. often called the reparametrization trick, stochastic back-propagation or perturbation The .requiremen requirementt that f be con contin tin tinuous uous and differen differentiable tiable of course requires y analysis to be con contin tin tinuous. uous. If we wish to back-propagate through a sampling pro process cess that f y The requiremen t that b e con tin uous and differen tiable of course requires pro produces duces discrete-v discrete-valued alued samples, it ma may y still be possible to estimate a gradient on to b e con tin uous. If we wish to back-propagate through a sampling pro cess that ω, using reinforcemen reinforcementt learning algorithms suc such h as varian ariants ts of the REINFOR REINFORCE CE produces discrete-v alued samples, it main y still e possible to estimate a gradient on algorithm (Williams , 1992 ), discussed Sec.b20.9.1 . ω, using reinforcement learning algorithms such as variants of the REINFORCE 690Sec. 20.9.1. algorithm (Williams, 1992), discussed in
CHAPTER 20. DEEP GENERATIVE MODELS
In neural net netw work applications, we typically choose z to be dra drawn wn from some simple distribution, such as a unit uniform or unit Gaussian distribution, and z to be drawn In eneural work applications, hoose from of some ac achiev hiev hieve more net complex distributionswbeytypically allo allowing wing cthe deterministic portion the simple distribution, such as a unit uniform or unit Gaussian distribution, and net netw work to reshap reshapee its input. achieve more complex distributions by allowing the deterministic portion of the The idea of propagating gradien gradients ts or optimizing through sto stochastic chastic op operations erations network to reshape its input. dates bac back k to the mid-t mid-tw wen entieth tieth cen century tury (Price, 1958; Bonnet, 1964) and was propagating gradien ts context or optimizing through stolearning chastic op erations, firstThe usedidea for of mac machine hine learning in the of reinforcement (Williams dates). bac k torecently the mid-t wentieth century to (Price , 1958; approximations Bonnet, 1964) and was 1992 1992). More recently, , it has been applied variational (Opper first used for mac hine learning in the context of reinforcement learning ( Williams and Arc Archam ham hamb beau, 2009) and stochastic or generativ generativee neural netw networks orks (Bengio, 1992 )., 2013b More; recently has; been applied toelling variational (Opper et al. al., Kingma, ,it2013 Kingma and W , 2014bapproximations ,a; Rezende et al. al., , 2014; and Arc ham b eau , 2009 ) and stochastic or generativ e neural netw orks ( Bengio Go Goo odfello dfellow w et al. al.,, 2014c). Many net networks, works, such as denoising auto autoenco enco encoders ders or et al. , 2013b ; Kingma , 2013 ; Kingma and W elling , 2014b , a ; Rezende et al. 2014; net netw works regularized with drop dropout, out, are also naturally designed to take, noise Go o dfello w et al. , 2014c ). Many net works, such as denoising auto enco ders or as an input without requiring any special reparametrization to mak makee the noise net works regularized with indep independen enden endent t from the mo model. del.dropout, are also naturally designed to take noise as an input without requiring any special reparametrization to make the noise independent from the model.
20.9.1
Bac Back-Propagating k-Propagating through Discrete Sto Stoc chastic Op Operations erations
20.9.1a mo Bac Sto chastic Op erations When model delk-Propagating emits a discrete through variable yDiscrete , the reparametrization trick is not applicable. Supp Suppose ose that the mo model del tak takes es inputs x and parameters θ, b both oth y When a mo del emits a discrete v ariable , the reparametrization trick is not encapsulated in the vector ω, and combines them with random noise z to pro produce duce applicable. Suppose that the model takes inputs x and parameters θ, both y : encapsulated in the vector ω, and combines them duce y = f (z; ω ). with random noise z to pro (20.59) y: Because y is discrete, f must be a step derivativ ativ atives es of a step function y = function. f (z; ω ). The deriv (20.59) are not useful at any point. Righ Rightt at eac each h step boundary oundary,, the deriv derivatives atives are Because y isbut discrete, ust beproblem. a step function. The derivativ of athe step function undefined, that isfamsmall The large problem is es that deriv derivativ ativ atives es are not useful at any point. Righ t at eac h step b oundary , the deriv atives are are zero almost everywhere, on the regions betw between een step boundaries. The deriv derivativ ativ atives es undefined, but that is a small problem. The large problem is that the deriv ativ es of any cost function J ( y ) therefore do not giv givee an any y information for ho how w to up update date are zero almost everywhere, the mo model del parameters θ. on the regions between step boundaries. The derivatives of any cost function J ( y ) therefore do not give any information for how to update REINFOR REINFORCE CE θalgorithm (REw (REward ard Incremen Incrementt = Non-negativ Non-negativee Factor × the The model parameters . Offset Reinforcemen Reinforcementt × Characteristic Eligibilit Eligibility) y) provides a framework defining a The REINFOR CE algorithm (REw ard Incremen t =).Non-negativ e Factor family of simple but powerful solutions (Williams, 1992 The core idea is that Offset Reinforcemen Eligibilit provides a framework ×a J ( f (z ;tω)) Characteristic ev even en though is a step function withy)useless deriv derivativ ativ atives, es, the defining exp expected ected family simple but × powerful solutions 1992). The core ideadescent. is that E zof cost smooth oth(Williams function ,amenable to gradient ∼p(z)J (f (z ; ω )) is often a smo J (expectation f (z ; ω)) is aisstep even though function useless deriv es, the expected Although that typically not with tractable when is high-dimensional y ativ E J ( f ( z ; ω )) cost is often a smo oth function amenable to gradient (or is the result of the comp composition osition of many discrete sto stochastic chastic decisions), itdescent. can be Although expectation is typically not tractable ischastic high-dimensional estimated that without bias using a Monte Carlo av average. erage.when The y sto stoc estimate of (or the result comp osition stogradien chastict-based decisions), it can be the isgradien gradient t canofbethe used with SGD of ormany other discrete sto stochastic chastic gradient-based optimization estimated tec techniques. hniques.without bias using a Monte Carlo average. The stochastic estimate of the gradient can be used with SGD or other stochastic gradient-based optimization techniques. 691
CHAPTER 20. DEEP GENERATIVE MODELS
The simplest version of REINFOR REINFORCE CE can be deriv derived ed by simply differentiating the exp expected ected cost: The simplest version of REINFORCE can be derived by simply differentiating X the expected cost:Ez [J (y)] = J (y)p(y ) (20.60)
y E [ J ( y )] = J (y)∂p(py(y) ) (20.60) X ∂ E[J (y )] (20.61) = J (y) ∂ω E∂ ω y ∂ [J (y )] ∂ p( y ) (20.61) = X J (y) ∂ log p(y) ∂ω ∂ ω = X J (y)p(y ) (20.62) ∂ω y ∂ log p(y) = J (yX )mp(y ) (20.62) ∂ ω(i) ∂ log p(y (i) ) 1 ≈X J (y ) . (20.63) m (i) ∂ω 1 y ∼p(y), i=1 ∂ log p(y ) J (y ) . (20.63) m ∂ ω Eq. 20.61 relies on the assumption does es not reference ω directly directly.. It is trivial ≈ X that J do to extend the approach to relax this assumption. Eq. 20.62 exploits the deriv derivativ ativ ativee ∂ log p(y ) ∂p( y ) es not reference ω directly. It is trivial J Eq. 20.61 relies on the assumption that do 1 = p(y) ∂ω . Eq. 20.63 giv rule for the logarithm, ∂ω gives es an unbiased Monte X to extend the approach to relax this assumption. Eq. 20.62 exploits the derivative Carlo estimator of the gradien gradient. t. = rule for the logarithm, . Eq. 20.63 gives an unbiased Monte An Anywhere ywhere we write p(y ) in this section, one could equally write p( y | x). This Carlo estimator of the gradient. is because p( y ) is parametrized by ω , and ω con contains tains both θ and x, if x is present. Anywhere we write p(y ) in this section, one could equally write p( y x). This One issue with the ab abov ov ovee simple REINFOR REINFORCE CE estimator is that it has a very is because p( y ) is parametrized by ω , and ω contains both θ and x, if x |is present. y high variance, so that man many y samples of need to be drawn to obtain a goo good d One issue with the abov simplealen REINFOR CEone estimator it has a vwill ery estimator of the gradient, ore equiv equivalen alently tly tly,, if only sample isis that drawn, SGD y need learning high variance, so thatand man y samples to be drawn a gooto d con conv verge very slowly will require of a smaller rate.toItobtain is possible estimator of the gradient, or equiv alen tly , if only one sample is drawn, SGD will considerably reduce the variance of that estimator by using varianc variancee reduction con v erge very slowly and will require a smaller learning rate. It isestimator possible so to metho methods ds (Wilson, 1984; L’Ecuyer, 1994). The idea is to mo modify dify the considerably reduce theremains varianceunchanged of that estimator usingget varianc e reduction that its expected value but its vby ariance reduced. In the metho ds ( Wilson , 1984 ; L’Ecuyer , 1994 ). The idea is to mo dify the estimator so con context text of REINFOR REINFORCE, CE, the prop proposed osed variance reduction methods in inv volv olvee the that its expected alue remains but Jits(yv)ariance get reduced. ω) computation of a vbaseline that isunchanged used to offset . Note that any offsetInb(the con text of REINFOR CE, the prop osed v ariance reduction methods in v olv e the that do does es not depend on y would not change the expectation of the estimated computation of a baseline that is used to offset J (y ). Note that any offset b(ω ) gradien gradientt because that does not depend on y would not change the expectation of the estimated X ∂ log p(y) ∂ log p(y) gradient because Ep(y) = p(y) (20.64) ∂ω ∂ω y ∂ log p(y) ∂ log p(y) E = X p∂(py()y) (20.64) ∂ω ∂ω = (20.65) ∂ω y ∂ p(y ) = ∂ X (20.65) ∂ ωp(y) = ∂ 1 = 00,, =X (20.66) ∂ω y ∂ω ∂ ∂ = p(y) = 1 = 0, (20.66) ∂ω ∂ω 692 X
X
CHAPTER 20. DEEP GENERATIVE MODELS
whic which h means that ∂ log p ( y ) ∂ log p ( y ) ∂ log p ( y ) whic that Ep(y)h means = E p(y) J (y) (J (y) − b(ω )) − b(ω)E p(y) ∂ω ∂ω ∂ω ∂ log p(y) ∂ log p(y) ∂ log p(y) (20.67) E =E J (y) (J (y) b(ω )) b(ω)E ∂ω ∂ω ∂ω ∂ log p(y) − − (20.67) = E p(y) J (y) . (20.68) ∂ω ∂ log p(y) = E J (y) . (20.68) Furthermore, we can obtain the optimal b (ω) by computing the variance of ( J (y) − ∂ω b( ω )) ∂ log∂ωp(y) under p( y ) and minimizing with respect to b( ω). What we find is Furthermore, we can obtain the optimal b (ω) by computing the variance of ( J (y) that this optimal baseline b∗ (ω)i is differen differentt for eac each h elemen elementt ωi of the vector ω: under p( y ) and minimizing with respect is b( ω )) b( ω). What we find − h i to 2 ∂ log p(y that this optimal baseline b (ω) isEdifferen t for eac h) element ω of the vector ω: p(y ) J (y) ∂ωi ∗ (20.69) b (ω)i = h i . p(y ) 2 E E p(yJ) (y∂ )log ∂ωi . (20.69) b (ω) = The gradien gradientt estimator with resp respect ect E to ωi then becomes ∂ log bpecomes (y) i h) )then The gradient estimator with(resp J (y)ect − bto (ωω (20.70) i ∂ ωi h∂ log p(y)i (J (ye) b ∗(bω(ω (20.70) )i.) )The∂estimate b is usually obtained where b (ω ) i estimates the abov above by ω − adding extra outputs to the neural net netw workiand training the new outputs to estimate ) ).2 The estimate b is usually obtained by where b (ω )∂ log estimates the abovhe∂blog(ω p(y )2 p(y Ep(y) [J ( y) ∂ω i ] and Ep(y) for each elemen elementt of ω . These extra ∂ωw i ork and training the new outputs to estimate adding extra outputs to the neural net outputs can be trained with the mean squared error ob objective, jective, using resp respectively ectively E ∂[Jlog ( yp)(y) 2 ] and E for each element of ω . These extra ∂ log p(y ) 2 J (y ) ∂ωi and ∂ω i as targets when y is sampled from p(y ), for a giv given en outputs can be trained with the mean squared error objective, using respectively ω. The estimate b ma may y then be recov recovered ered by substituting these estimates into Eq. J (y ). Mnih and y isa sampled p(y ), for andGregor (2014 as) targets when fromoutput a givall en 20.69 20.69. preferred to use single shared (across ω b . The estimate ma y then b e recov ered b y substituting these estimates into Eq. elemen elements ts i of ω) trained with theh target J(iy), using as baseline b (ω ) ≈ Ep(y)[ J (y)] )].. 20.69. Mnih and Gregor (2014) preferred to use a single shared output (across all Variance reduction metho methods ds ha hav ve been introduced in the reinforcement learning elements i of ω) trained with the target J(y), using as baseline b (ω ) E [ J (y)]. con context text (Sutton et al., 2000; Wea eav ver and Tao, 2001), generalizing previous work ≈ V ariance reduction metho ds ha ve ban een(1990 introduced in the reinforcement on the case of binary rew reward ard by Day Dayan ). See Bengio et al. (2013blearning ), Mnih con text ( Sutton et al. , 2000 ; W ea v er and T ao , 2001 ), generalizing previous and Gregor (2014), Ba et al. (2014), Mnih et al. (2014), or Xu et al. (2015work ) for on the case binary rewof ardthe byREINFOR Dayan (1990 See Bengio al. (2013b ), Mnih examples of of mo modern dern uses REINFORCE CE ). algorithm withetreduced variance in and Gregor ( 2014 ), Ba et al. ( 2014 ), Mnih et al. ( 2014 ), or Xu et al. ( 2015 ) for the context of deep learning. In addition to the use of an input-dep input-dependent endent baseline examples of mo dern uses of the REINFOR CE algorithm with reduced could bin e b( ω ), Mnih and Gregor (2014) found that the scale of (J (y ) − b( ω))variance the context of deep learning. In addition to the use of an input-dep endent baseline adjusted during training by dividing it by its standard deviation estimated by a ), Mnih and during Gregortraining, (2014) found that of theadaptiv scale eoflearning (J (y ) rate, be b( ωving b( ω))tocould mo moving average as a kind adaptive counter adjusted training by dividing itsduring standard ya the effectduring of important variations thatitoby ccur thedeviation course−of estimated training inbthe moving average training, kind of (adaptiv e learning rate, to varianc countere magnitude of thisduring quantit quantity y. Mnih as anda Gregor 2014) called this heuristic variance the effect of important v ariations that o ccur during the course of training in the normalization normalization.. magnitude of this quantity. Mnih and Gregor (2014) called this heuristic variance 693 normalization.
CHAPTER 20. DEEP GENERATIVE MODELS
REINF REINFOR OR ORCE-based CE-based estimators can be understoo understood d as estimating the gradient by correlating choices of y with corresp corresponding onding values of J (y). If a go goood value of y REINF OR CE-based estimators can b e understoo d as estimating gradient is unlik unlikely ely under the curren currentt parametrization, it migh mightt tak takee a long timethe to obtain it y J ( y y b y correlating c hoices of with corresp onding v alues of ) . If a go o d v alue of by chance, and get the required signal that this configuration should be reinforced. is unlikely under the current parametrization, it might take a long time to obtain it by chance, and get the required signal that this configuration should be reinforced.
20.10
Directed Generativ Generative e Nets
As discussed in Chapter Generativ 16, directed graphical models dels make up a prominen prominentt class 20.10 Directed e Netsmo of graphical mo models. dels. While directed graphical mo models dels ha have ve been very popular As discussed in Chapter 16 , directed graphical mo dels make up a prominen t class within the greater mac machine hine learning comm communit unit unity y, within the smaller deep learning of graphical models. While directed graphical modelswed have een very popular comm communit unit unity y they hav havee until roughly 2013 been ov overshado ershado ershadowed bybundirected mo models dels within the greater mac hine learning comm unit y , within the smaller deep learning suc such h as the RBM. community they have until roughly 2013 been overshadowed by undirected models In this section we review some of the standard directed graphical mo models dels that such as the RBM. ha hav ve traditionally been asso associated ciated with the deep learning comm communit unit unity y. In this section we review some of the standard directed graphical models that We hav havee already described deep belief net networks, works, whic which h are a partially directed have traditionally been associated with the deep learning community. mo model. del. We hav havee also already described sparse co coding ding models, which can be thought W e hav e already described deep b elief net works, whichused are aaspartially of as shallo shallow w directed generativ generativee models. They are often feature directed learners mothe del. con W e hav e also already described sparse co ding models, which can b e thought in of deep learning, though they tend to perform p o orly at sample context text of as shallo w directed generativ e models. They are often used as feature learners generation and densit density y estimation. We now describe a variet ariety y of deep, fully directed in the con text of deep learning, though they tend to perform poorly at sample mo models. dels. generation and density estimation. We now describe a variety of deep, fully directed models.
20.10.1
Sigmoid Belief Nets
20.10.1belief Sigmoid Sigmoid netw networks orksBelief (Neal, Nets 1990) are a simple form of directed graphical mo model del with a specific kind of conditional probability distribution. In general, we can Sigmoid netwborks 1990 areving a simple formofofbinary directed graphical del s, withmo think of b a elief sigmoid elief(Neal net netw w,ork as) ha having a vector states eac each h with a specific kind of conditional probability distribution. In general, we can elemen elementt of the state influenced by its ancestors: think of a sigmoid belief network as having a vector of binary states s, with each element of the state influenced by its X ancestors: p(si ) = σ Wj,i sj + bi . (20.71) p(s ) = σ
j
W s +b
.
(20.71)
The most common structure of sigmoid belief net netw work is one that is divided in into to man many y lay layers, ers, with ancestral sampling pro proceeding ceeding through a series of man many y work is one that is divided The most common structure of sigmoid b elief net hidden la lay yers and then ultimately generating visible la lay yer. This structure is X prothe to similar many lay sampling ceeding a series of manofy except vinery toers, the with deep ancestral belief net netw w ork, that the through units at the beginning hidden layers pro andcess then generating theother, visible layer.than This structure is the sampling process areultimately indep independent endent from each rather sampled from vaery similar to the deep mac belief netwSuc ork, that the units at the restricted Boltzmann machine. hine. Such h except a structure is interesting forbaeginning variet ariety y of of the sampling process are independent from each other, rather than sampled from a restricted Boltzmann machine. Such694 a structure is interesting for a variety of
CHAPTER 20. DEEP GENERATIVE MODELS
reasons. One reason is that the structure is a universal appro approximator ximator of probabilit probability y distributions over the visible units, in the sense that it can approximate any reasons. One reason is that structure is a universal appro of probabilit y probabilit probability y distribution ov over erthe binary variables arbitrarily well,ximator given enough depth, distributions overofthe units, in the sense that it can approximate ev even en if the width the visible individual la layers yers is restricted to the dimensionality of any the probabilit y distribution ov er binary v ariables arbitrarily w ell, given enough depth, visible la lay yer (Sutsk Sutskev ev ever er and Hin Hinton ton, 2008). even if the width of the individual layers is restricted to the dimensionality of the While generating a sample of the visible units is very efficien efficientt in a sigmoid visible layer (Sutskever and Hinton, 2008). belief netw network, ork, most other op operations erations are not. Inference over the hidden units giv given en While generating a sample of the visible units is v ery efficien t in a sigmoid the visible units is in intractable. tractable. Mean field inference is also intractable because the b elief netw ork, most other opolves erations are not. Inference of over the hidden units given variational lo low wer bound inv involves taking exp expectations ectations cliques that encompass the visible units is problem intractable. fielddifficult inference is alsotointractable the en entire tire lay layers. ers. This has Mean remained enough restrict thebecause popularity vofariational lower bound involves taking expectations of cliques that encompass directed discrete net netw works. entire layers. This problem has remained difficult enough to restrict the popularity One approach for performing inference in a sigmoid belief net netw work is to construct of directed discrete networks. a differen differentt lo low wer bound that is sp specialized ecialized for sigmoid belief net networks works (Saul et al., One approach for performing in a sigmoid netnet work is to construct 1996 1996). ). This approac approach h has only inference been applied to verybelief small netw works. Another a differen t lo w er b ound that is sp ecialized for sigmoid b elief net works ( Saul al., approac approach h is to use learned inference mechanisms as describ described ed in Sec. 19.5.etThe 1996 ). This approac h has only been applied to very small net w orks. Another Helmholtz machine (Da Day yan et al. al.,, 1995; Day Dayan an and Hinton, 1996) is a sigmoid belief approac h is to use learned inference mechanisms describthe ed in Sec. 19.5of . The net netw work com combined bined with an inference net netw work that as predicts parameters the Helmholtz (Dayoan al.,hidden 1995; Day an and Hinton , 1996) is a(sigmoid belief mean field machine distribution veretthe units. Mo Modern dern approaches Gregor et al., net w ork com bined with an inference net w ork that predicts the parameters of the 2014; Mnih and Gregor, 2014) to sigmoid belief netw networks orks still use this inference mean field distribution o v er the hidden units. Mo dern (Gregor et al. net netw work approac approach. h. These techniques remain difficult dueapproaches to the discrete nature of, 2014 ; Mnih and Gregor , 2014 ) to sigmoid belief netw orks still use this inference the latent variables. One cannot simply bac back-propagate k-propagate through the output of the network approac h.but These techniques remain difficult due to themachinery discrete nature of inference net netw work, instead must use the relativ relatively ely unreliable for backthe latent variables. cannot simplypro baccesses, k-propagate through the20.9.1 output of thet propagating through One discrete sampling processes, described in Sec. . Recen Recent inferencehes netbased work, but instead must sampling, use the relativ ely unreliable machinery for backapproac approaches on imp importance ortance reweigh reweighted ted wak wake-sleep e-sleep (Bornschein propagating through discrete sampling pro cesses, described in Sec. 20.9.1 . and Bengio, 2015) and bidirectional Helmholtz mac machines hines (Bornsc Bornschein hein et al. al.,Recen , 2015t) approac based imp sampling, tedand wakreac e-sleep (Bornschein mak makee it hes possible to on quic quickly klyortance train sigmoid beliefreweigh net netw works reach h state-of-the-art and Bengio, 2015 and bidirectional p erformance on b)enc enchmark hmark tasks. Helmholtz machines (Bornschein et al., 2015) make it possible to quickly train sigmoid belief networks and reach state-of-the-art A sp special ecial on case of sigmoid belief net netw works is the case where there are no laten latentt performance benc hmark tasks. variables. Learning in this case is efficien efficient, t, because there is no need to marginalize At sp ecial caseout of of sigmoid belief net is the where there are no latent laten latent variables the likelihoo likelihood. d. wAorks family of case mo models dels called auto-regressive vnet ariables. thisfully case is efficien t, because there no need to of marginalize netw works Learning generalizeinthis visible belief net netw work to isother kinds variables laten t v ariables out of the likelihoo d. A family of mo dels called auto-regressive besides binary variables and other structures of conditional distributions besides lognetworks generalize Auto-regressiv this fully visible belief work to other kinds of variables linear relationships. Auto-regressive e net netw worksnet are describ described ed later, in Sec. 20.10.7. besides binary variables and other structures of conditional distributions besides loglinear relationships. Auto-regressive networks are described later, in Sec. 20.10.7.
20.10.2
Differen Differentiable tiable Generator Nets
20.10.2 Differen tiable Generator Netsof using a differen Man Many y generativ generative e models are based on the idea differentiable tiable gener generator ator network network.. The mo model del transforms samples of laten latentt variables z to samples x or Many generative models are based on the idea of using a differentiable generator network. The model transforms samples of latent variables z to samples x or 695
CHAPTER 20. DEEP GENERATIVE MODELS
to distributions ov over er samples x using a differen differentiable tiable function g(z; θ (g) ) whic which h is typically represen represented ted by a neural net netw work. This mo model del class includes variational g(znet, ; θ generativ to distributions overh samples using a differen tiableanfunction ) which ise auto autoenco enco encoders, ders, whic which pair thex generator net with inference generative typically represen ted by a neural work. This mo del classwith includes variational adv adversarial ersarial net netw works, which pairnet the generator net netw work a discriminator auto enco ders, whic h pair the generator net with an inference net, generative net netw work, and tec techniques hniques that train generator net netw works in isolation. adversarial networks, which pair the generator network with a discriminator Generator netw networks orks are essen essentially tially just parametrized computational pro procedures cedures network, and techniques that train generator networks in isolation. for generating samples, where the architecture pro provides vides the family of possible Generator netw orks are essen tially just parametrized procedures distributions to sample from and the parameters select acomputational distribution from within for generating samples, where the architecture pro vides the family of p ossible that family family.. distributions to sample from and the parameters select a distribution from within an example, the standard pro procedure cedure for dra drawing wing samples from a normal thatAs family . distribution with mean µ and cov covariance ariance Σ is to feed samples z from a normal As an example, the standard pro cedure for drawing from generator a normal distribution with zero mean and iden identit tit tity y co cov variance in into to asamples very simple µ Σ z distribution with mean and cov ariance is to feed samples from a normal net netw work. This generator netw network ork con contains tains just one affine la lay yer: distribution with zero mean and identity covariance into a very simple generator network. This generator network one affine layer: x =con g(tains z ) = just µ + Lz (20.72) x =decomp g(z ) =osition µ + Lzof Σ. where L is giv given en by the Cholesky decomposition
(20.72)
Pseudorandom num umb ber generators can also use nonlinear transformations of where L is given by the Cholesky decomposition of Σ. simple distributions. For example, inverse tr transform ansform sampling (Devroy Devroyee, 2013) Pseudorandom n um b er generators can also use nonlinear transformations of (0,, 1) and applies a nonlinear transformation to a scalar dra draws ws a scalar z from U (0 simple distributions. For example, inverse ansform sampling (Devroyfunction e, 2013) x . In this case g (z ) is given by the in inverse verse oftrthe cum cumulativ ulativ ulative e distribution Rscalar x z U (0 , dra ws a from 1) and applies a nonlinear transformation to a scalar specify ecify p (x), in integrate tegrate ov over er x, and inv invert ert the F (x) = −∞ p(v)dv . If we are able to sp x g ( z . In this case ) is given by the in verse of the cum ulativ e distribution function resulting function, we can sample from p(x) without using mac machine hine learning. F (x ) = p(v)dv . If we are able to specify p (x), integrate over x, and invert the To generate samples from more complicated distributio distributions ns that are difficult resulting function, we can sample from p(x) without using machine learning. to sp specify ecify directly directly,, difficult to integrate over, or whose resulting in integrals tegrals are T o generate samples from more complicated distributio ns that are difficult to in inv vert, we use a feedforward netw network ork to represent a parametricdifficult family to nonlinear specify directly , difficult to integrate odata ver, to or infer whosetheresulting integrals are of functions , and use training parameters selecting g R difficult to in vert, we use a feedforward network to represent a parametric family the desired function. of nonlinear functions g , and use training data to infer the parameters selecting We can think of g as providing a nonlinear change of variables that transforms the desired function. the distribution over z into the desired distribution ov over er x. We can think of g as providing a nonlinear change of variables that transforms Recall from Eq. 3.47 that, for in inv vertible, differen differentiable, tiable, con contin tin tinuous uous g, the distribution over z into the desired distribution over x. ∂ g tiable, Recall from Eq. 3.47 that, for invertible, differen continuous g, (20.73) pz (z) = px (g(z)) det( ) . ∂z ∂g (20.73) p (z) = p (g(z)) det( ) . This implicitly imp imposes oses a probabilit probability y distribution∂ozver x: This implicitly imposes a probability distribution pz (g−1(x)) over x: px(x) = . ∂g ) det( p (g ∂z (x)) p (x) = . 696 det( )
(20.74) (20.74)
CHAPTER 20. DEEP GENERATIVE MODELS
Of course, this formula ma may y be difficult to ev evaluate, aluate, depending on the choice of g, so we often use indirect means of learning g, rather than trying to maximize Of p course, this formula may be difficult to evaluate, depending on the choice of log (x) directly directly. . g, so we often use indirect means of learning g, rather than trying to maximize cases, provide vide a sample of x directly directly,, we use g log pIn(xsome ) directly . rather than using g to pro to define a conditional distribution ov over er x. For example, we could use a generator g to outputs x directly somefinal cases, thanofusing provide to a sample , we use g net In whose la lay yrather er consists sigmoid pro provide videofthe mean parameters to Bernoulli define a conditional distribution over x. For example, we could use a generator of distributions: net whose final layer consists of sigmoid outputs to provide the mean parameters p(x i = 1 | z ) = g(z)i . (20.75) of Bernoulli distributions: x by In this case, when we use g to pdefine impose over er (ze)imp (x = 1p(xz|) z=),gw . ose a distribution ov (20.75) marginalizing z: |x z ), we impose a distribution over x by In this case, when we use g to define (20.76) p(x) =p(E z p(x | z ). marginalizing z: | E x)). and allow us to train v(20.76) Both approaches define a distribution arious p(x) = p(xpg (z criteria of pg using the reparametrization tric trick k| of Sec. 20.9. Both approaches define a distribution p ( x) and allow us to train various The of tw two different t approaches to formulating generator criteria po differen using the reparametrization trick of Sec. 20.9. nets—emitting the parameters of a conditional distribution versus directly emitting samples—ha samples—hav ve The tw o differen t approaches to formulating generator nets—emitting the complemen complementary tary strengths and weaknesses. When the generator net defines a parameters of a conditional versus emitting samples—ha ve x, it is capable conditional distribution overdistribution of directly generating discrete data as well complemen tarydata. strengths and When the generator net defines a as con contin tin tinuous uous When theweaknesses. generator net provides samples directly directly, , it is x, ituous conditional distribution over is capable of could generating discrete data as well capable of generating only contin continuous data (we introduce discretization in as con tin uous data. When the generator net provides samples directly , it is the forw forward ard propagation, but this would lose the abilit ability y to learn the mo model del using capable of generating only contin uous data (we could introduce discretization in bac back-propagation). k-propagation). The adv advan an antage tage to direct sampling is that we are no longer the forw ard propagation, but this would lose form the abilit y to learnwritten the modown del using forced to use conditional distributions whose can be easily and back-propagation). The adv to designer. direct sampling is that we are no longer algebraically manipulated by an a tage human forced to use conditional distributions whose form can be easily written down and Approac Approaches hes based on differentiable generator netw networks orks are motiv motivated ated by the algebraically manipulated by a human designer. success of gradien gradientt descent applied to differentiable feedforward netw networks orks for ApproachesInbased on differentiable generator netw orks are motiv ated byorks the classification. the con of supervised learning, deep feedforw netw context text feedforward ard networks success with of gradien t descentlearning appliedseem to differentiable feedforward networks for trained gradient-based practically guaran guaranteed teed to succeed given classification. the and context of supervised learning, deepsame feedforw ard orks enough hiddenIn units enough training data. Can this recip recipe e fornetw success trained with gradient-based learning seem practically guaran teed to succeed given transfer to generativ generativee mo modeling? deling? enough hidden units and enough training data. Can this same recipe for success Generativ Generativee mo modeling deling seems to be more difficult than classification or regression transfer to generative modeling? because the learning process requires optimizing intractable criteria. In the context Generativ e mo deling seems becriteria more difficult than classification or regression of differen differentiable tiable generator nets,tothe are intractable because the data do does es b ecause the learning process requires optimizing intractable criteria. In the context not sp specify ecify both the inputs z and the outputs x of the generator net. In the case of differen tiable generator nets, criteria arethe intractable the es x and of sup supervised ervised learning, both thethe inputs outputs ybecause were giv given, en,data anddo the z only not specify both the inputs and the outputs of pro theduce generator net. In mapping. the case optimization pro procedure cedure needs to learn ho how wxto produce the sp specified ecified x and the y were of sup learning, bmodeling, oth the inputs outputsneeds en, and ho thew In the ervised case of generative the learning procedure to giv determine how optimization pro cedure needs only to learn ho w to pro duce the sp ecified mapping. to arrange z space in a useful way and additionally ho how w to map from z to x. In the case of generative modeling, the learning procedure needs to determine how to arrange z space in a useful way and 697 additionally how to map from z to x.
CHAPTER 20. DEEP GENERATIVE MODELS
Doso Dosovitskiy vitskiy et al. (2015) studied a simplified problem, where the corresp correspondence ondence bet etw ween z and x is giv given. en. Specifically Specifically,, the training data is computer-rendered Dosovitskiy et al.The (2015 ) studied a simplified problem, where the ondence imagery of chairs. laten latent t variables given tocorresp the rendering z are parameters z x between and the is giv en. Specifically , the training data computer-rendered engine describing choice of whic which h chair mo model del to use, theisposition of the chair, imagery of chairs. The laten t v ariables are parameters given to the rendering z and other configuration details that affect the rendering of the image. Using this engine describing the choice of awhic hvolutional chair model toork use,isthe position of to themap chair, z syn synthetically thetically generated data, con conv netw network able to learn and other configuration details that affect the rendering of the image. Using this descriptions of the conten contentt of an image to x appro approximations ximations of rendered images. syn thetically generated data, a con v olutional netw ork isnet able to learn map zt This suggests that con contemp temp temporary orary differen differentiable tiable generator netw works ha hav ve to sufficien sufficient x descriptions of the conten t of an image to appro ximations of rendered images. mo model del capacity to be goo good d generativ generativee mo models, dels, and that con contemp temp temporary orary optimization This suggests temp tiable generator orks have sufficien algorithms ha hav vthat e thecon abilit ability y orary to fit differen them. The difficulty liesnet inwdetermining how tot modelgenerator capacity to bw e orks goodwhen generativ models, temp x is train net netw the evalue of zand for that eac each hcon notorary fixedoptimization and known algorithms ha v e the abilit y to fit them. The difficulty lies in determining how to ahead of eac each h time. train generator networks when the value of z for each x is not fixed and known The follo following wing sections describ describee sev several eral approaches to training differen differentiable tiable ahead of each time. generator nets giv given en only training samples of x. The following sections describe several approaches to training differentiable generator nets given only training samples of x.
20.10.3
Variational Auto Autoenco enco encoders ders
20.10.3 Variational Auto The variational auto autoenc enc enco oder or enco VAE ders (Kingma, 2013; Rezende et al., 2014) is a directed mo model del that uses learned approximate inference and can be trained purely The variational autometho encoder with gradien gradient-based t-based methods. ds.or VAE (Kingma, 2013; Rezende et al., 2014) is a directed model that uses learned approximate inference and can be trained purely To generate a sample from the mo model, del, the VAE first dra draws ws a sample z from with gradient-based methods. the co code de distribution p model( z ). The sample is then run through a differentiable z from T o generate a gsample from, the del, the VAE first drawspa sample (z ). Finally x is mo generator netw network ork Finally, sampled from a distribution model( x; g (z)) = code distribution The sample then run through a differentiable p during ( z ). training, pthe Howev ev ever, er, the isapproximat approximate e inference net netw work (or model(x | z ). How g ( z x p ( x ; (z))der = generator netw ork ) . Finally , is sampled from a distribution enco encoder) der) q(z | x) is used to obtain z and pmodel (x | z) is then viewed as a gdeco decoder p (x z ). However, during training, the approximate inference network (or net netw work. (x z) is then viewed as a decoder encoder) |q(z x) is used to obtain z and p The key insigh insightt behind variational auto autoenco enco encoders ders is that they ma may y be trained network. | | by maximizing the variational low lower er bound L(q ) associated with data point x: The key insight behind variational autoencoders is that they may be trained L(q) = Ethe z, x + H(q((qz) |associated x)) (20.77) by maximizing low(er b)ound with data point x: z ∼q(vzariational |x) log pmodel | x)|| ||p pmodel (z)) (20.78) log ppmodel ((x (q) = = Ez∼q(z|x) log z, |xz))+− DL (KL q(z(q(xz)) (20.77) E ≤ x).p (20.79) L (x z) HD (q| (z x) p = log pmodel(log (z)) (20.78) | −as the joint | log-likelihoo || In Eq. 20.77, we log-likelihood d of the(20.79) visible logrecognize p (x).the first term and hidden v≤ ariables under the appro approximate ximate posterior over the latent variables (just In Eq. 20.77 , we recognize the first as the joint log-likelihoo d of the visible lik likee with EM, except that we use an term appro approximate ximate rather than the exact posterior). and hidden variables under the appro over ximate the latent variablesWhen (just W e recognize also a second term, theximate entrop entropy yposterior of the appro approximate posterior. EM, that we distribution, use an approximate rather thantothe exact posterior). qlikise cwith hosen to except be a Gaussian with noise added a predicted mean W e recognize also a second term, the entrop y of the appro ximate posterior. When value, maximizing this en entropy tropy term encourages increasing the standard deviation q is chosen to be a Gaussian distribution, with noise added to a predicted mean 698 value, maximizing this entropy term encourages increasing the standard deviation
CHAPTER 20. DEEP GENERATIVE MODELS
of this noise. More generally generally,, this en entrop trop tropy y term encourages the variational posterior to place high probabilit probability y mass on man many y z values that could ha hav ve generated x, of this noise. More generally , this en trop y term encourages the v ariational posterior rather than collapsing to a single point estimate of the most lik likely ely value. In Eq. z x, to place high probabilit y mass on man y v alues that could ha v e generated 20.78 20.78,, we recognize the first term as the reconstruction log-lik log-likeliho eliho elihoo od found in ratherauto thanenco collapsing to asecond single term point tries estimate of the most likely value. In Eq. other autoenco encoders. ders. The to mak make e the approximate posterior 20.78 , we recognize term thepreconstruction log-likelihood found in distribution q(z | x) the and first the mo model delas prior model (z ) approach each other. other autoencoders. The second term tries to make the approximate posterior Traditional and learning infer q via an distribution q(z approaches x) and thetomovariational del prior p inference (z) approach each other. optimization algorithm, typically iterated fixed point equations (Sec. 19.4). These | q via Traditional approaches to require variational inference and learning infer an) approac approaches hes are slo slow w and often the ability to compute pmodel (z, x Ez∼q log optimization algorithm, typically iterated fixed point equations (Sec. 19.4). These in closed form. The main idea behind the variational auto autoenco enco encoder E der is to train a approaches are slo w and require called the ability to compute (z, x ) parametric enco encoder der (alsooften sometimes an inference net network worklog or precognition in closed form. The main idea behind ariational encouous der is to trainwea mo model) del) that pro produces duces the parameters of ofqqthe . Sovlong as z isauto a contin continuous variable, parametric enco der (also sometimes called an inference net work or recognition can then back-propagate through samples of z dra drawn wn from q (z | x) = q (z ; f (x; θ )) moorder del) that produces the parameters of qect . Sotolong as z is a contin uous variable, e in to obtain a gradient with resp respect then consists solelywof θ . Learning x) deco q (zand = q (der. z ; f (All x; θof )) can then back-propagate through of z dra wn enco fromder L with resp maximizing respect ect to thesamples parameters of the encoder decoder. in order to obtaininaLgradient resp ect tobθ Learning then | consists solely of the exp expectations ectations ma may y be with appro approximated ximated y .Mon Monte te Carlo sampling. maximizing with respect to the parameters of the encoder and decoder. All of variational auto autoenco enco encoder der approach is elegan elegant, t, theoretically pleasing, and the The expectations L in may be approximated by Monte Carlo sampling. simple to implement. It also obtains excellent results and is among the state of L encoder approach is elegant, theoretically pleasing, and variational the The art approac approaches hes auto to generativ generativee modeling. Its main dra drawbac wbac wback k is that samples simple to implement. It also obtains excellent results and is among state of from variational auto autoenco enco encoders ders trained on images tend to be somewhat the blurry blurry. . The the art approac hes to generativ e modeling. Its main dra wbac k is that samples causes of this phenomenon are not yet known. One possibilit ossibility y is that the blurriness from ariational autoof enco ders trained on images tend to be somewhat blurry . The kpmodel DKL(pdata is an vintrinsic effect maximum likelihoo likelihood, d, whic which h minimizes ). causes of this phenomenon are not yet known. One p ossibilit y is that the blurriness As illustrated in Fig. 3.6, this means that the mo model del will assign high probabilit probability y to p D ( p is an intrinsic effect of maximum likelihoo d, whic h minimizes ). poin oints ts that occur in the training set, but may also assign high probability to other As illustrated in Fig. 3.6 , this means that the mo del will assign high probabilit y to poin oints. ts. These other points ma may y include blurry images. Part of the reason kthat the p oin ts that o ccur in the training set, but may also assignimages high probability to some other mo model del would choose to put probabilit probability y mass on blurry rather than points.part These other points may include blurryauto images. Partused of the reason that the other of the space is that the variational autoenco enco encoders ders in practice usually movdel ould choose to put probabilit y (mass images arather some x; g(zon ha hav e aw Gaussian distribution for pmodel )) ))..blurry Maximizing low lower er than bound on other part of the space is that the v ariational auto enco ders used in practice usually the likelihoo likelihood d of such a distribution is similar to training a traditional auto autoenco enco encoder der p ( x; g(itz )) have mean a Gaussian distribution for sense . Maximizing a low er bound on with squared error, in the that has a tendency to ignore features thethe likelihoo of such a distribution is similar training autoenco der of inputd that occup ccupy y few pixels or thattocause onlya traditional a small change in the with mean squared error, in they the sense that it has a tendency to ignore features brigh brightness tness of the pixels that occupy ccupy. . This issue is not sp specific ecific to VAEs and of the input that o ccup y few pixels or that cause only a small change in the, is shared with generativ generativee mo models dels that optimize a log-likelihoo log-likelihood, d, or equiv equivalently alently alently, brigh(tness of the pixels that they occupy. This issue is not specific to VAEs and D KL pdata kpmodel ), as argued by Theis et al. (2015) and by Huszar (2015). Another is shared issue with with generativ e mo dels V that optimize a log-likelihoo or equiv troubling con contemp temp temporary orary AE mo models dels is that they tend d, to use only alently a small, D (p of the p dimensions ), as argued al. der (2015 ) and y Huszar (2015). Another subset of z ,by asTheis if theetenco encoder was notbable to transform enough troubling issue with con temp orary V AE mo dels is that they tend to use only a small k of the lo local cal directions in input space to a space where the marginal distribution subset the factorized dimensionsprior. of z , as if the encoder was not able to transform enough matc matches hesofthe of the local directions in input space to a space where the marginal distribution matches the factorized prior. 699
CHAPTER 20. DEEP GENERATIVE MODELS
The VAE framework is very straigh straightforw tforw tforward ard to extend to a wide range of mo model del arc architectures. hitectures. This is a key adv advan an antage tage over Boltzmann machines, which require The VAE framework very straigh tforw ardtractability to extend to aAEs widewrange del extremely careful modelisdesign to main maintain tain tractability. . V ork vof erymowell architectures. This is of a key advtiable antageoperators. over Boltzmann machines, which require with a diverse family differen differentiable One particularly sophisticated extremely careful model design to main tain tractability . V AEs w ork v ery well VAE is the de deep ep recurr current ent attention writer or DRA DRAW W mo model del (Gregor et al., 2015 ). with a diverse family ofenco differen tiable operators. Onecombined particularly DRA DRAW W uses a recurrent encoder der and recurrent decoder withsophisticated an atten attention tion V AEhanism. is the deThe ep rgeneration ecurrent attention DRA mo (Gregor al., 2015). mec mechanism. pro process cess writer for theorDRA DRAW WWmo model deldelconsists of et sequentially DRAW differen uses a recurrent encoder and decoder with an atten tion visiting different t small image patc patches hes recurrent and dra drawing wing the vcombined alues of the pixels at those mec hanism. The generation pro cess for the DRA W mo del consists of sequentially poin oints. ts. VAEs can also be extended to generate sequences by defining variational visiting anda dra wing theenco values thedecoder pixels atwithin those RNNs (differen Ch Chung ung tetsmall al. al.,, image 2015b)patc by hes using recurrent encoder der ofand poinV ts. canork. alsoGenerating be extended to generate by defining ariational the AEVAEs framew framework. a sample fromsequences a traditional RNN in inv vvolv olves es only RNNs ( Ch ung et al. , 2015b ) by using a recurrent enco der and decoder non-deterministic op operations erations at the output space. Variational RNNs alsowithin hav havee the V AE framew ork. Generating a sample from a traditional RNN in v olv es random variabilit ariability y at the poten otentially tially more abstract level captured by the Vonly AE non-deterministic laten latentt variables. operations at the output space. Variational RNNs also have random variability at the potentially more abstract level captured by the VAE The VAE framew framework ork has been extended to maximize not just the traditional laten t variables. variational low lower er bound, but instead the imp importanc ortanc ortancee weighte weighted d auto autoenc enc enco oder (Burda The V AE framew ork has been extended to maximize not just the traditional et al. al.,, 2015) ob objectiv jectiv jective: e: variational lower bound, but instead the importance weighted autoencoder (Burda " # k et al., 2015) ob jective: 1 X pmodel(x, z(i) ) Lk(x, q) = Ez (1) ,...,z(k) ∼q(z|x) log . (20.80) (i) | x) k q ( z 1 i=1 p (x, z ) E (x, q) = log . (20.80) k ) q(zlow L when k = 1. This new Lob objectiv jectiv jectivee is equiv equivalent alent to the traditional lower erxbound | of the true log pmodel (x) Ho How wev ever, er, it may also be in interpreted terpreted as forming an estimate k = 1. This new ob jectiv e is equiv alent to the traditional low er bound when using imp importance ortance sampling of z from prop proposal importance ortance " osal distribution q( z | x#). The imp log pbecomes (x ) Ho wev er,auto it may also in terpreted as aforming an estimate the true p model (xL) and w eigh eighted ted autoenco enco encoder derbe ob objectiv jectiv jective e is also low lower erX bound on log of z q ( using imp ortance sampling of from prop osal distribution ) . The imp ortance z x tigh tighter ter as k increases. (x) and becomes weighted autoencoder ob jective is also a lower bound on log p | Variational auto autoenco enco encoders ders hav havee some interesting connections to the MP-DBM tighter as k increases. and other approac approaches hes that in inv volv olvee bac back-propagation k-propagation through the approximate V ariational auto enco ders hav e some to theetMP-DBM inference graph (Goo Goodfellow dfellow et al., 2013b;interesting Stoy Stoyanov anov etconnections al. al.,, 2011; Brakel al. al.,, 2013). and other approac hes that in v olv e bac k-propagation through the approximate These previous approac approaches hes required an inference procedure such as mean field fixed inference graph ( Goo dfellow et al. , 2013b; Stoyanov et al. , 2011 ; Brakel auto et al.enco , 2013 p oin equations to provide the computational graph. The variational ointt autoenco encoder der). These previous approaches required angraphs, inference procedure as mean to field fixed is defined for arbitrary computational whic which h makessuch it applicable a wider point equations to provide computational graph. The variational auto der range of probabilistic mo model delthe families because there is no need to restrict theenco choi choice ce is defined for arbitrary computational graphs, whic h makes it applicable to a wider of mo models dels to those with tractable mean field fixed point equations. The variational range of probabilistic model families because there is anobound need to the choice auto autoenco enco encoder der also has the adv advan an antage tage that it increases on restrict the log-likelihoo log-likelihood d of mo dels to those with tractable mean field fixed p oint equations. The v ariational of the model, while the criteria for the MP-DBM and related mo models dels are more auto enco der also has the adv an tage that it increases a b ound on the d heuristic and ha hav ve little probabilistic interpretation beyond making log-likelihoo the results of of the model, while the criteria for the MP-DBM and related mo dels are more appro approximate ximate inference accurate. One disadv disadvantage antage of the variational auto autoenco enco encoder der heuristic havan e little probabilistic interpretation b eyond making the results of. z is that it and learns inference net for only one problem, inferring giv network work given en x approximate inference accurate. One disadvantage of the variational autoencoder 700 only one problem, inferring z given x. is that it learns an inference network for
CHAPTER 20. DEEP GENERATIVE MODELS
The older methods are able to perform appro approximate ximate inference over an any y subset of variables given any other subset of variables, because the mean field fixed point The older sp methods are able to pparameters erform appro ximate inference over anygraphs subset for of equations specify ecify ho how w to share bet etw ween the computational vall ariables given anytother subset of variables, because the mean field fixed point of these differen different problems. equations specify how to share parameters between the computational graphs for One very nice property of the variational auto autoenco enco encoder der is that simultaneously all of these different problems. training a parametric enco encoder der in combination with the generator netw network ork forces One v ery nice property of the v ariational auto enco der is that simultaneously the mo model del to learn a predictable co coordinate ordinate system that the enco encoder der can capture. training a parametric enco der in combination with the generator netw forces This mak makes es it an excellent manifold learning algorithm. See Fig. 20.6 for ork examples the mo del to learn a predictable co ordinate system that the enco der can capture. of lo low-dimensional w-dimensional manifolds learned by the variational auto autoenco enco encoder. der. In one of the This es it an excellent manifold algorithm. Seetw Fig. 20.6endent for examples casesmak demonstrated in the figure, thelearning algorithm disco discov vered two o indep independent factors of lo w-dimensional manifolds learned b y the v ariational auto enco der. In one of the of variation present in images of faces: angle of rotation and emotional expression. cases demonstrated in the figure, the algorithm discovered two independent factors of variation present in images of faces: angle of rotation and emotional expression.
Figure 20.6: Examples of tw two-dimensional o-dimensional coordinate systems for high-dimensional manifolds, learned by a variational auto autoencoder encoder (Kingma and Welling, 2014a). Two dimensions Figure Examples o-dimensional coordinate so systems high-dimensional manima may y b e20.6: plotted directlyofontw the page for visualization, we canforgain an understanding of folds, learned by a v ariational auto encoder ( Kingma and W elling , 2014a ). T wo dimensions ho the mo works by training a mo with a 2-D latent code, even if w e b eliev the how w model del model del elievee ma y b e plotted directly on the data page manifold for visualization, we canThe gainimages an understanding of in intrinsic trinsic dimensionality of the is muc uch h so higher. sho shown wn are not how the mo del the works by training a mo del with a 2-D latent code,by even wedel b eliev examples from training set but images generated theifmo model x actually p (xe| the z), intrinsicbydimensionality of the data manifold is m uch higher. images wnofare not simply changing the 2-D “code” (each h image corresp corresponds onds toThe a differen different t csho hoice “code” z (eac x actually generated p (x z), examples the training set images byFthe delmanifold. on a 2-Dfrom uniform grid). (L (Left) eft)but The tw two-dimensional o-dimensional map of the rey mo faces z z simply by changing “code” (each(horizontal) image corresp onds to a different to choice of “code” | of One dimension thatthe has2-D been disco discov vered mostly corresponds a rotation (Left) The z onface, a 2-D uniform grid). two-dimensional map of the Frey faces manifold. the while the other (vertical) corresp corresponds onds to the emotional expression. (Right) The dimension that discovered (horizontal) mostly corresponds to a rotation of tOne wo-dimensional maphas of been the MNIST manifold. the face, while the other (vertical) corresp onds to the emotional expression. (Right) The two-dimensional map of the MNIST manifold.
701
CHAPTER 20. DEEP GENERATIVE MODELS
20.10.4
Generativ Generative e Adversarial Netw Networks orks
Generativ Generative adv adversarial ersarial net netw wdversarial orks or GANs (Go Goo oorks dfellow et al., 2014c) are another 20.10.4 e Generativ eA Netw generativ generativee mo modeling deling approac approach h based on differen differentiable tiable generator net netw works. Generative adversarial networks or GANs (Goodfellow et al., 2014c) are another Generativ Generative adv adversarial ersarial netw networks orks on aredifferen based tiable on a game theoretic scenario in generativ e moedeling approac h based generator networks. whic the generator netw must comp against an adv The generator which h network ork compete ete adversary ersary ersary.. Generativ e adv ersarial netw orks are based on a game theoretic scenario in ( g ) net netw work directly pro produces duces samples x = g (z; θ ). Its adversary adversary,, the discriminator whic h the generatortonetw ork mustbetw comp etesamples againstdrawn an advfrom ersary . The generator network network, , attempts distinguish etween een the training data x = g ( z ; θ net w ork directly pro duces samples ) . Its adversary , the discriminator and samples dra drawn wn from the generator. The discriminator emits a probabilit probability y value network , attempts from the training data (d) to distinguish b etween samples drawn d ( x ; θ x giv given en by ), indicating the probabilit probability y that is a real training example and samples wn from thedra generator. Themo discriminator emits a probability value rather than adra fak fake e sample drawn wn from the model. del. given by d(x ; θ ), indicating the probability that x is a real training example Thethan simplest ay to formulate learning indel. generative adversarial netw networks orks is rather a fakewsample drawn from the mo ( g ) ( d ) as a zero-sum game, in which a function v (θ , θ ) determines the pay payoff off of the The simplestThe waygenerator to formulate learning adversarial netwDuring orks is (g )generative , θ (d) ) as its discriminator. receiv receives es −v(θin own pay payoff. off. v (θits ,own θ )pa as a zero-sum which atofunction determines theatpay off of the learning, eac each h game, pla play yer in attempts maximize pay yoff, so that con conv vergence discriminator. The generator receives v(θ , θ ) as its own payoff. During learning, each player attempts maximize itsvown g∗ to = arg min−max (g, d)pa . yoff, so that at convergence (20.81) g
The default choice for v is
d
g = arg min max v(g, d).
(20.81)
The default is data log d(x) + E x∼pmodel log (1 − d(x)) . v(θ (gc)hoice , θ (d) )for = vE x∼p (20.82) E E ves (θthe, discriminator θ )= log d(x) + log (1 classify d(x)) . samples(20.82) This driv drives to attempt to learn to correctly as real or fake. Simultaneously Simultaneously,, the generator attempts to fool the−classifier into believing This driv es the attempt learn to correctly samples as real its samples are discriminator real. At con conv vto ergence, thetogenerator’s samplesclassify are indistinguishable or fake. generator attempts fool the classifier into believing from realSimultaneously data, and the, the discriminator outputs 12toev everywhere. erywhere. The discriminator its ysamples real. At convergence, the generator’s samples are indistinguishable ma may then beare discarded. from real data, and the discriminator outputs everywhere. The discriminator The main motiv motivation ation for the design of GANs is that the learning pro process cess may then be discarded. requires neither appro approximate ximate inference nor approximation of a partition function Thet.main thev(design GANs the cess g) max g, d) is of gradien gradient. In themotiv caseation wherefor con conv vex in isθ (that (suc (such h aslearning the casepro where d requires neither approximate inference norspace approximation of a density partitionfunctions) function optimization is performed directly in the of probability max ( gcon , d) viserge θ (such as theconsisten gradien In cedure the caseis where conand vex in case where then thet. pro procedure guaran guaranteed teed vto conv is asymptotically consistent. t. optimization is performed directly in the space of probability density functions) Unfortunately Unfortunately,, learning in GANs can be difficult in practice when g and d then the procedure is guaranteed to converge and is asymptotically consistent. are represen represented ted by neural net netw works and maxd v(g, d ) is not con conv vex. Go Goo odfellow Unfortunately , learning in GANs can b e difficult in practice when and d (2014) identified non-con non-conv vergence as an issue that may cause GANs to gunderfit. max v ( g , d aregeneral, represensim tedultaneous by neuralgradien networks andt on two pla )yers’ is not conisvex. odfellow In simultaneous gradient t descen descent play costs not Go guaran guaranteed teed (to 2014 ) identified non-con v ergence as an issue that may cause GANs to underfit. ab,, reach an equilibrium. Consider for example the value function v(a, b ) = ab In general, simultaneous descencost t onabtw players’ costs is not teed a and tincurs b where one play player er controlsgradien , owhile the other pla play yerguaran con controls trols v ( a, b ) = ab to reach an equilibrium. Consider for example the v alue function and receiv receives es a cost −ab . If we mo model del each play player er as making infinitesimally small, where one player controls a and incurs cost ab, while the other player controls b 702 player as making infinitesimally small and receives a cost ab . If we model each −
CHAPTER 20. DEEP GENERATIVE MODELS
gradien gradientt steps, eac each h play player er reducing their own cost at the expense of the other pla play yer, then a and b go into a stable, circular orbit, rather than arriving at the gradient steps, eac h the playorigin. er reducing ownequilibria cost at the of the other equilibrium poin oint t at Note their that the for expense a minimax game are a b pla y er, then and go into a stable, circular orbit, rather than arriving at the not lo local cal minima of v. Instead, they are poin oints ts that are sim simultaneously ultaneously minima equilibrium p oin t at the origin. Note that the equilibria for a are for both play players’ ers’ costs. This means that they are saddle pointsminimax of v that game are lo local cal v. the not localwith minima of to Instead, they areparameters points that are simmaxima ultaneously minima respect first play player’s er’s and lo local cal withminima respect v that forthe both players’ costs. This means It that they arefor saddle points ofers are turns local to second pla play yer’s parameters. is possible the tw two o play players to take minima with respect to the first play er’s parameters and local maxima v forev increasing then decreasing forever, er, rather than landing exactly onwith the respect saddle to the second pla y er’s parameters. It is possible for the tw o play ers to take turns poin ointt where neither play player er is capable of reducing their cost. It is not known to v forevproblem increasing er, rather than GANs. landing exactly on the saddle what exten extenttthen this decreasing non-con non-conv vergence affects point where neither player is capable of reducing their cost. It is not known to Go Goo odfello dfellow w (2014) iden identified tified an alternativ alternativee form formulation ulation of the pa pay yoffs, in what extent this non-convergence problem affects GANs. whic which h the game is no longer zero-sum, that has the same exp expected ected gradien gradientt Go o dfello w ( 2014 ) iden tified an alternativ e form ulation of the pa offs, in as maxim maximum um likelihoo likelihood d learning whenever the discriminator is optimal. yBecause whic h um the lik game longer zero-sum, has the same ected gradien maxim maximum likeliho eliho elihoooisd no training con conv verges, thisthat reform reformulation ulation of theexp GAN game shouldt as maxim um likelihoo d learning whenever the discriminator is optimal. Because also con conv verge, giv given en enough samples. Unfortunatley Unfortunatley, , this alternativ alternative e formulation maxim umseem likeliho training conin verges, this reform ulation GAN game should do does es not to opderform well practice, possibly due of to the sub suboptimality optimality of the also con v erge, giv en enough samples. Unfortunatley , this alternativ e formulation discriminator, or possibly due to high variance around the exp expected ected gradient. does not seem to perform well in practice, possibly due to suboptimality of the In practice, the best-p est-performing erforming formulation of the GAN game is a different fordiscriminator, or possibly due to high variance around the expected gradient. mulation that is neither zero-sum nor equiv equivalent alent to maximum likelihoo likelihood, d, introduced In practice, the b est-p erforming formulation of the GAN game is a different forby Go Goo odfello dfellow w et al. (2014c) with a heuristic motiv motivation. ation. In this best-p est-performing erforming m ulation that is neither zero-sum nor equiv alent to maximum likelihoo d, introduced form formulation, ulation, the generator aims to increase the log probability that the discriminaby Go odfello w et al.e,(rather 2014c) than with aiming a heuristic motivation. In this best-performing tor mak makes es a mistak mistake, to decrease the log probability that the form ulation, the generator aims to increase the log probability that the discriminadiscriminator mak makes es the correct prediction. This reformulation is motiv motivated ated solely tor mak es a mistak e, rather than aiming to decrease the log probability that the by the observ observation ation that it causes the deriv derivativ ativ ativee of the generator’s cost function discriminator mak es the correct prediction. This reformulation is motiv ated solely with resp respect ect to the discriminator’s logits to remain large ev even en in the situation by thethe observ ation that confiden it causes derivall ativgenerator e of the generator’s where discriminator confidently tlythe rejects samples. cost function with respect to the discriminator’s logits to remain large even in the situation Stabilization of GANconfiden learning an generator op open en problem. Fortunately ortunately,, GAN where the discriminator tlyremains rejects all samples. learning performs well when the mo model del architecture and hyp yperparameters erparameters are careStabilization of GAN learning remains an op en problem. Fortunately , GAN fully selected. Radford et al. (2015) crafted a deep conv convolutional olutional GAN (DCGAN) learning performs well when model architecture hyp erparameters arelatent carethat performs very well for the image synthesis tasks,and and sho show wed that its fully selected. et al. (imp 2015 ) crafted a deep convolutional GANin(DCGAN) represen representation tation Radford space captures important ortant factors of variation, as shown Fig. 15.9. that p erforms v ery well for image synthesis tasks, and sho w ed that its See Fig. 20.7 for examples of images generated by a DCGAN generator. latent representation space captures important factors of variation, as shown in Fig. 15.9. The GAN learning problem can also be simplified by breaking the generation See Fig. 20.7 for examples of images generated by a DCGAN generator. pro process cess in into to many levels of detail. It is possible to train conditional GANs (Mirza The GAN problem be simplified by breakingp(the x | generation y) rather and Osindero,learning 2014) that learncan to also sample from a distribution pro cess in to many levels of detail. It is p ossible to train conditional GANs than simply sampling from a marginal distribution p( x). Den Denton ton et al. ((Mirza 2015) p ( x y and Osindero , 2014 ) that learn to sample from a distribution ) rather sho show wed that a series of conditional GANs can be trained to first generate a very p( xadd than simply sampling from a marginal ). Den ton to et| the al. (image. 2015) lo low-resolution w-resolution version of an image, thendistribution incrementally details showed that a series of conditional GANs can be trained to first generate a very low-resolution version of an image, then 703incrementally add details to the image.
CHAPTER 20. DEEP GENERATIVE MODELS
Figure 20.7: Images generated by GANs trained on the LSUN dataset. (L (Left) eft) Images of b edro edrooms oms generated by a DCGAN mo model, del, repro reproduced duced with p ermission from Radford (Left) Figure 20.7:). Images GANs trained onbthe dataset. Images et al. (2015 (R (Right) ight)generated Images ofby ch churc urc urches hes generated y a LSUN LAPGAN mo model, del, repro reproduced duced of b edro oms generated by a DCGAN mo).del, repro duced with p ermission from Radford with p ermission from Denton et al. (2015 et al. (2015). (Right) Images of churches generated by a LAPGAN mo del, repro duced with p ermission from Denton et al. (2015).
This technique is called the LAPGAN mo model, del, due to the use of a Laplacian pyramid to generate the images containing varying levels of detail. LAPGAN generators This technique called thediscriminator LAPGAN monet del,works due to thealso use human of a Laplacian pyramid are able to fo fool olisnot only netw but observers, with to generate the images containing v arying levels of detail. LAPGAN generators exp experimen erimen erimental tal sub subjects jects identifying up to 40% of the outputs of the netw network ork as being are able to fo ol not only discriminator net w orks but also human observers, with real data. See Fig. 20.7 for examples of images generated by a LAPGAN generator. experimental sub jects identifying up to 40% of the outputs of the network as being real data. See Fig. 20.7 for examples of images generated by a LAPGAN generator. One un unusual usual capability of the GAN training procedure is that it can fit probabilit bility y distributions that assign zero probability to the training poin oints. ts. Rather than One un usual capability of the GAN training procedure is that can fittoprobamaximizing the log probability of sp specific ecific poin oints, ts, the generator netitlearns trace bilit y distributions that assign zero probability to the training p oin ts. Rather than out a manifold whose poin oints ts resem resemble ble training points in some way. Somewhat paramaximizing the log probability ofdel specific points,a the generator netnegative learns to tracey do doxically xically xically,, this means that the mo model may assign log-likelihoo log-likelihood d of infinit infinity outthe a manifold oints resemble training pointsthat in some way. observ Somewhat parato test set, whose while pstill represen representing ting a manifold a human observer er judges doxically , this that the generation model may task. assignThis a log-likelihoo d of an negative infinitory to capture themeans essence of the is not clearly adv advantage antage to the test set, while still represen ting a manifold that a h uman observ er judges a disadv disadvantage, antage, and one may also guarantee that the generator netw network ork assigns to captureprobabilit the essence generation This is clearly advantage or non-zero probability y to of allthe points simply task. by making thenot last lay layer er an of the generator a disadv and onenoise maytoalso that thevalues. generator network assigns net netw work antage, add Gaussian all guarantee of the generated Generator net netw works non-zero probabilit y to all p oints simply b y making the last lay er of the generator that add Gaussian noise in this manner sample from the same distribution that one networkbadd Gaussian noise tonet allwof values. Generator networks obtains y using the generator netw orkthe to generated parametrize the mean of a conditional that add Gaussian noise in this manner sample from the same distribution that one Gaussian distribution. obtains by using the generator network to parametrize the mean of a conditional Drop Dropout out seems to be imp important ortant in the discriminator net netw work. In particular, Gaussian distribution. units should be sto stocchastically dropp dropped ed while computing the gradien gradientt for the Drop out seems to b e imp ortant in the discriminator net w ork. In particular, generator netw network ork to follow. Following the gradien gradientt of the deterministic version of units should be sto chastically dropp ed while computing the gradien t effective. for the the discriminator with its weigh divided b y t w o do not seem to b e as weights ts does es generator network to follow. Following the gradient of the deterministic version of 704 by two do es not seem to b e as effective. the discriminator with its weights divided
CHAPTER 20. DEEP GENERATIVE MODELS
Lik Likewise, ewise, nev never er using drop dropout out seems to yield poor results. While the GAN framew framework ork is designed for differen differentiable tiable generator net netw works, Likewise, never using dropout seems to yield poor results. similar principles can be used to train other kinds of mo models. dels. For example, selfWhile the GAN framew ork is designed for differen tiable generator works, sup supervise ervise ervised d boosting can be used to train an RBM generator to fo fool ol anetlogistic similar principles can be(W used to et train regression discriminator elling al. al.,, other 2002).kinds of models. For example, selfsupervised boosting can be used to train an RBM generator to fool a logistic regression discriminator (Welling et al., 2002).
20.10.5
Generativ Generative e Momen Momentt Matc Matching hing Net Netw works
Gener Generative ative Generativ moment matching networks (Lihing et al.Net , 2015 ; Dziugaite et al., 2015) 20.10.5 e Momen t Matc works are another form of generative mo model del based on differentiable generator netw networks. orks. Gener ative moment matching networks ( Li et al. , 2015 ; Dziugaite et al. , 2015 Unlik Unlikee VAEs and GANs, they do not need to pair the generator netw network ork with any) are another form of generative model based onused differentiable networks. other netw an inference netw with VAEsgenerator nor a discriminator network—neither ork—neither network ork as Unlik e VAEs andwith GANs, they do not need to pair the generator network with any net netw work as used GANs. other network—neither an inference network as used with VAEs nor a discriminator These net networks works are trained with a tec technique hnique called moment matching. The network as used with GANs. basic idea behind moment matching is to train the generator in suc such h a wa way y that These net works are trained with a tec hnique called moment matching . The man of the statistics of samples generated by the model are as similar as p ossible many y basic ideaofbehind momentofmatching is to train generator h acon watext, y that to those the statistics the examples in thethe training set. inInsuc this context, a man y of the statistics of samples generated by the model are as similar as p ossible moment is an expectation of different pow owers ers of a random variable. For example, to those of the statistics of the examples the training In of thisthe con text, a the first moment is the mean, the secondinmomen moment t is theset. mean squared is an expectation of different poweach ers ofelement a random variable. example, vmoment alues, and so on. In multiple dimensions, of the randomFor vector may the first moment is the mean, the second momen t is the mean of the squared be raised to differen differentt powers, so that a momen momentt ma may y be an any y quan quantit tit tity y of the form values, and so on. In multiple dimensions, each element of the random vector may n i t may b e any quantity of the form be raised to different powers, so thatEax Πmomen (20.83) ix i E where n = [[n n 1, n2, . . . , n d]> is a vector Π of xnon-negativ non-negativee in integers. tegers.
(20.83)
Up Upon on=first this approach seems to bee computationally infeasible. n , . . . , n ] is where n [n ,examination, a vapproac ector ofh non-negativ integers. For example, if we wan antt to match all the moments of the form xi xj , then we need Upon firstthe examination, this een approac h seems be computationally infeasible. to minimize difference betw a num that is quadratic in the between number ber oftovalues x x F or example, if we w an t to match all the moments of the form , then we need dimension of x. Moreov Moreover, er, even matching all of the first and second moments toould minimize difference een aariate number of values that is quadratic in the w only bethe sufficient to fitbetw a multiv multivariate Gaussian distribution, which captures dimension of . Moreov er, even matching all of the first and second moments x only linear relationships betw etween een values. Our ambitions for neural netw networks orks are to w ould only b e sufficient to fit a multiv ariate Gaussian distribution, which captures capture complex nonlinear relationships, whic which h would require far more moments. only linear relationships betw values. Ouren ambitions netwb orks are to GANs avoid this problem ofeen exhaustively enumerating umeratingfor allneural moments y using a capture complex nonlinear relationships, whic h would require far more moments. dynamically updated discriminator, that automatically focuses its attention on GANs aer void this problem of exhaustively umerating moments b y .using a whic whichev hev hever statistic the generator net netw work is en matc matching hing theall least effectiv effectively ely ely. dynamically updated discriminator, that automatically focuses its attention on Instead, generativ generativee momen momentt matc matching hing net netw works can be trained by minimizing whichever statistic the generator network is matching the least effectively. a cost function called maximum me mean an discr discrep ep epancy ancy (Schölk Schölkopf opf and Smola, 2002; Instead, generativ e momen t matc hing net w orks can b e trained by minimizing Gretton et al. al.,, 2012) or MMD. This cost function measures the error in the first a cost function called maximum me an discr ep ancy ( Schölk opf and Smola , 2002; momen moments ts in an infinite-dimensional space, using an implicit mapping to feature Gretton et al., 2012) or MMD. This cost function measures the error in the first 705 using an implicit mapping to feature moments in an infinite-dimensional space,
CHAPTER 20. DEEP GENERATIVE MODELS
space defined by a kernel function in order to make computations on infinitedimensional vectors tractable. The MMD cost is zero if and only if the tw two o space defined by a k ernel function in order to make computations on infinitedistributions being compared are equal. dimensional vectors tractable. The MMD cost is zero if and only if the two Visually Visually,, the samples from generativ generative momentt matching net netw works are somewhat distributions being compared are equal.e momen disapp disappoin oin ointing. ting. Fortunately ortunately,, they can be impro improv ved by com combining bining the generator Visually , the from momen t matching networks are somewhat net netw w ork with an samples auto autoenco enco encoder. der.generativ First, ane auto autoenco enco encoder der is trained to reconstruct the disapp oin ting. F ortunately , they can be impro v ed by com bining the generator training set. Next, the enco encoder der of the auto autoenco enco encoder der is used to transform the entire net w ork with an auto enco der. First, an auto enco der to reconstruct the training set into co code de space. The generator netw network orkisistrained then trained to generate training set. Next, the enco der of the auto enco der is used to transform the entire co code de samples, which may be mapped to visually pleasing samples via the deco decoder. der. training set into code space. The generator network is then trained to generate Unlike e GANs, cost is defined withsamples resp respect ectvia tothe a batc batch of codeUnlik samples, whichthe may be function mapped to visually only pleasing decohder. examples from both the training set and the generator net network. work. It is not possible Unlik e GANs, the cost function is defined only with respect to a batc h of to mak makee a training update as a function of only one training example or only examples from b oth the training set and the generator net work. It is not possible one sample from the generator net netw work. This is because the momen moments ts must be to make aastraining update as a across function of only one When training orisonly computed an empirical average many samples. theexample batc batch h size to tooo one sample from the generator net w ork. This is b ecause the momen ts must be small, MMD can underestimate the true amoun amountt of variation in the distributions computed as an empirical a v erage across many samples. When the batc h size is to being sampled. No finite batch size is sufficiently large to eliminate this problemo small, can underestimate amoun of variation in the distributions en entirely tirely tirely,MMD , but larger batc batches hes reducethe thetrue amoun amount t of tunderestimation. When the batc batch h beingis sampled. batchpro size is sufficiently to eliminate this problem size to too o large,No thefinite training procedure cedure becomes large infeasibly slo slow, w, because man many y en tirely , but larger batc hes reduce the amoun t of underestimation. When the batc h examples must be pro processed cessed in order to compute a single small gradien gradientt step. size is too large, the training procedure becomes infeasibly slow, because many As with GANs, it is possible to train a generator net using MMD even if that examples must be processed in order to compute a single small gradient step. generator net assigns zero probabilit probability y to the training poin oints. ts. As with GANs, it is possible to train a generator net using MMD even if that generator net assigns zero probability to the training points.
20.10.6
Con Conv volutional Generative Net Netw works
20.10.6 Convimages, olutional Generative orks When generating it is often useful to Net use awgenerator netw network ork that includes a conv convolutional olutional structure (see for example Go Gooodfello dfellow w et al. (2014c) or Doso Dosovitskiy vitskiy When generating images, it is often useful to use a generator netw ork that includes et al. (2015)). To do so, we use the “transp “transpose” ose” of the conv convolution olution op operator, erator, adescribed convolutional structure (see for example Go o dfello w et al. ( 2014c ) or Doso vitskiy describ in Sec. 9.5 . This approac often yields more realistic images and do ed approach h does es et al. ( 2015 )). T o do so, w e use the “transp ose” of the conv olution op erator, so using fewer parameters than using fully connected lay layers ers without parameter described in Sec. 9.5. This approach often yields more realistic images and does sharing. so using fewer parameters than using fully connected layers without parameter Con Conv volutional net netw works for recognition tasks ha hav ve information flow from the sharing. image to some summarization la er at the top of the netw lay y network, ork, often a class lab label. el. Con v olutional net w orks for recognition tasks ha v e information flow from the As this image flo flows ws up upw ward through the net netw work, information is discarded as the image totation some of summarization layer atmore the top of the netw ork, often a class label. represen representation the image becomes in inv varian ariant t to nuisance transformations. As this image flo ws up w ard through the net w ork, information is discarded as the In a generator netw network, ork, the opp opposite osite is true. Rich details must be added as represen tation of the image b ecomes more in v arian t to n uisance transformations. the represen representation tation of the image to be generated propagates through the netw network, ork, In a generator netw ork,representation the opposite of is the true. Richwhic details must be the added as culminating in the final image, which h is of course image the represen tation of the image be ob generated propagates through the netwand ork, itself, in all of its detailed glory glory,, to with object ject positions and poses and textures culminating in the final representation of the image, which is of course the image itself, in all of its detailed glory, with ob706 ject positions and poses and textures and
CHAPTER 20. DEEP GENERATIVE MODELS
ligh lighting. ting. The primary mec mechanism hanism for discarding information in a con conv volutional recognition net netw work is the pooling lay layer. er. The generator netw network ork seems to need to lighting. The primary mechanism forinv discarding a con olutional add information. We cannot put the inverse erse of a pinformation ooling lay layer er in into thevgenerator recognition network is pthe pooling layer. are Thenot generator netw seemsop toeration need to net netw work because most ooling functions inv invertible. ertible. Aork simpler operation is add information. W e cannot put the inv erse of a p o oling lay er into the generator to merely increase the spatial size of the representation. An approac approach h that seems netpwerform ork because most is poto oling notasinv ertible. A by simpler operation is to acceptably use functions an “un-p “un-pooare oling” introduced Doso Dosovitskiy vitskiy et al. merely increase the spatial of the representation. An approac h that under seems (to 2015 ). This lay layer er corresp corresponds ondssize to the inv inverse erse of the max-p max-po ooling op operation eration to perform acceptably is to use an “un-p ooling” byoling Dosoop vitskiy et al. certain simplifying conditions. First, the strideasofintroduced the max-po max-pooling operation eration is (constrained 2015). Thistolayber corresp onds to the inv erse of the max-p o oling op eration under e equal to the width of the pooling region. Second, the maxim maximum um certainwithin simplifying First, the stride eration is input eac each h pconditions. ooling region is assumed toofbethe themax-po input oling in theopupp upper-left er-left constrained to ,ball e equal to the width of within the pooling theassumed maximum corner. Finally Finally, non-maximal inputs each region. po pooling oling Second, region are to input within eac h p o oling region is assumed to b e the input in the upp er-left be zero. These are very strong and unrealistic assumptions, but they do allow the corner.ooling Finally , erator all non-maximal inputsThe within eachun-p pooling region are assumed to max-p max-po op operator to be inv inverted. erted. inv inverse erse un-po ooling op operation eration allo allocates cates zero. These are then very copies strong eac and but they dothe allow the i of abetensor of zeros, each hunrealistic value fromassumptions, spatial coordinate input max-p ooling operator ito× be invthe erted. The inv erseinteger un-pooling to spatial coordinate output. The value op defines allo the cates size k eration k of i a tensor of zeros, then copies eac h v alue from spatial coordinate of the input of the pooling region. Even though the assumptions motiv motivating ating the definition of to spatial coordinate of the output. The integer v alue size k defines i k the un-p un-po ooling op operator erator are unrealistic, the subsequen subsequentt la lay yers are able to the learn to of the p o oling region. Even though the assumptions motiv ating the definition × comp for its unusual output, so the samples generated b y the mo as a compensate ensate model del of the un-p o oling op erator are unrealistic, the subsequen t la y ers are able to learn to whole are visually pleasing. compensate for its unusual output, so the samples generated by the model as a whole are visually pleasing.
20.10.7
Auto-Regressiv Auto-Regressive e Net Networks works
Auto-regressiv Auto-regressive e net netw works are directed probabilistic mo models dels with no latent random 20.10.7 Auto-Regressiv e Networks variables. The conditional probabilit probability y distributions in these mo models dels are represen represented ted Auto-regressiv netw(sometimes orks are directed probabilistic models with nosuc latent by neural net netw weorks extremely simple neural net netw works such h asrandom logistic vregression). ariables. TheThe conditional probabilitofy distributions models aregraph. represen ted graph structure these mo models delsinisthese the complete They by neural orks (sometimes extremely simple vneural netusing worksthe succhhain as logistic decomp decompose osenet aw join joint t probabilit probability y ov over er the observed ariables rule of regression). The graph structure of these mo dels is the complete graph. , x 1 ). probabilit probability y to obtain a pro product duct of conditionals of the form P (x d | x d−1 , . . . They decomp ose a join y ovful erly-visible the observed variables using the chain of Suc Such h mo models dels ha hav vte probabilit been called fully-visible Bayes networks (FVBNs) andrule used P (xeac , . . . , x ). x probability to product of with conditionals the form for successfully inobtain many aforms, first logistic of regression each h conditional Such models(Fha been) and called fulwith ly-visible networks (FVBNs) used | unitsand distribution reyv,e 1998 then neuralBayes net netw works with hidden (Bengio successfully in many forms, first with logistic regression for eac h conditional and Bengio, 2000b; Laro Larocchelle and Murra Murray y, 2011). In some forms of autodistribution ( F rey , 1998 ) and then with neural net w orks with (Bengio regressiv net orks, suc as NADE ( Laro and Murray ,hidden 2011),units describ regressivee netw w such h Larochelle chelle described ed in and Bengio , 2000b ; Laro c helle and Murra y , 2011 ). In some forms of Sec. 20.10.10 below, we can introduce a form of parameter sharing that brings autob oth regressiv e net w orks, suc h as NADE ( Laro chelle and Murray , 2011 ), describ ed in a statistical adv advantage antage (fewer unique parameters) and a computational adv advantage antage Sec. 20.10.10 below, w e can introduce a form of parameter sharing that brings b oth (less computation). This is one more instance of the recurring deep learning motif a statistical adv antage (fewer unique parameters) and a computational advantage of reuse of fe featur atur atures es es.. (less computation). This is one more instance of the recurring deep learning motif of reuse of features. 707
CHAPTER 20. DEEP GENERATIVE MODELS
x1
x2
x3
P (x3 | x1 , x2 )
P (x1 )
P (x4 | x1 , x2 , x3 )
P (x2 | x1 )
x1
x4
x2
x3
x4
Figure 20.8: A fully visible belief net netw work predicts the i -th variable from the i − 1 previous ones. (T (Top) op) The directed graphical mo model del for an FVBN. (Bottom) Corresponding i -theach 1 i by Figure 20.8: Agraph, fully in visible belief netlogistic work predicts variable fromisthe computational the case of the FVBN, the where prediction made (T op) (Bottom) ones. The directed graphical model for an FVBN. Corresponding − aprevious linear predictor. computational graph, in the case of the logistic FVBN, where each prediction is made by a linear predictor.
20.10.8
Linear Auto-Regressiv Auto-Regressive e Netw Networks orks
The simplestLinear form ofAuto-Regressiv auto-regressiv auto-regressivee net netw ork hasorks no hidden units and no sharing 20.10.8 ewNetw of parameters or features. Eac Each h P (x i | x i−1, . . . , x 1 ) is parametrized as a linear The simplest of auto-regressiv e netwdata, ork has no hidden unitsfor andbinary no sharing mo model del (linearform regression for real-v real-valued alued logistic regression data, x P ( x , . . . , x of parameters or features. Eac h ) is parametrized as a linear softmax regression for discrete data). This mo model del was introduced by Frey (1998) mo del (linear regression for real-v alued data, logistic binary data, 2 | and has O(d ) parameters when there are d variables toregression mo model. del. It for is illustrated in softmax regression for discrete data). This mo del was introduced by F rey ( 1998 ) Fig. 20.8. and has O(d ) parameters when there are d variables to model. It is illustrated in the. variables are con contin tin tinuous, uous, a linear auto-regressive mo model del is merely another Fig.If20.8 way to form formulate ulate a multiv multivariate ariate Gaussian distribution, capturing linear pairwise If the v ariables are con tin uous, linear auto-regressive model is merely another in interactions teractions bet etw ween the observ observed ed avariables. way to formulate a multivariate Gaussian distribution, capturing linear pairwise Linear auto-regressive netw networks orks are essen essentially tially the generalization of linear interactions between the observ ed variables. classification metho methods ds to generative mo modeling. deling. They therefore hav havee the same Linear auto-regressive netw orks are essen tially the generalization of linear adv advantages antages and disadv disadvan an antages tages as linear classifiers. Like linear classifiers, they may classification metho ds to generative mo deling. They therefore hav e the same be trained with conv convex ex loss functions, and sometimes admit closed form solutions adv antages and disadv antages as linear linear classifiers, classifiers. Like linear itself classifiers, may (as in the Gaussian case). Like the model does they not offer with convits ex capacity loss functions, and y sometimes admit using closedtechniques form solutions abewtrained ay of increasing capacity, , so capacit capacity must be raised like (as in the Gaussian case). Like linear classifiers, the model itself does not offer basis expansions of the input or the kernel tric trick. k. a way of increasing its capacity, so capacity must be raised using techniques like basis expansions of the input or the kernel 708 trick.
CHAPTER 20. DEEP GENERATIVE MODELS
P (x3 | x1 , x2 )
P (x1 ) P (x2 | x1 )
h1
x1
P (x4 | x1 , x2 , x3 )
h2
h3
x2
x3
x4
Figure 20.9: A neural auto-regressive net netw work predicts the i-th variable xi from the i − 1 previous ones, but is parametrized so that features (groups of hidden units denoted hi ) i-thofvariable x from vthe i 1 Figure A neural workinpredicts the all that are20.9: functions of x1auto-regressive reused predicting the subsequent ariables , . . . , xi can b e net h previous xi+1 , xi+2ones, , . . . , xbut d. is parametrized so that features (groups of hidden units denoted− ) that are functions of x , . . . , x can b e reused in predicting all of the subsequent variables x ,x ,...,x .
20.10.9
Neural Auto-Regressiv Auto-Regressive e Netw Networks orks
20.10.9auto-regressiv Neural Auto-Regressiv e Netw orks , 2000a,b) hav Neural auto-regressive e netw networks orks (Bengio and Bengio havee the same left-to-righ left-to-rightt graphical mo model del as logistic auto-regressive net netw works (Fig. 20.8) but Neural auto-regressiv e netw orks ( Bengio and Bengio , 2000a ,b) havwithin e the same emplo a different parametrization of the conditional distributions that employ y left-to-righ t graphical mo del as logistic auto-regressive net w orks (Fig. 20.8 ) but graphical mo model del structure. The new parametrization is more pow owerful erful in the sense emploits y acapacit different theh conditional distributions within that that capacity y canparametrization be increased asofmuc much as needed, allowing approximation of graphical mo del structure. The new parametrization is more p ow erful in the sense an any y join jointt distribution. The new parametrization can also impro improv ve generalization that its capacit y can b e increased as muc h as needed, allowing approximation of by int intro ro roducing ducing a parameter sharing and feature sharing principle common to deep any jointindistribution. new canthe also impro learning general. TheThe mo models dels parametrization were motiv motivated ated by ob objectiv jectiv jectiveveeofgeneralization avoiding the b y int ro ducing a parameter sharing and feature sharing principle common deep curse of dimensionality arising out of traditional tabular graphical models, to sharing learning in general. The mo dels were motiv ated by the ob jectiv e of a v oiding the the same structure as Fig. 20.8. In tabular discrete probabilistic mo models, dels, eac each h curse of dimensionality out of traditional tabular graphical models, sharing conditional distributionarising is represen by a table of probabilities, with one entry represented ted the same structure for as Fig. . In tabular discrete h and one parameter eac each h 20.8 possible configuration of probabilistic the variables mo in inv vdels, olv olved. ed.eac By conditional distribution is represen by a tableare of obtained: probabilities, with one entry using a neural net netw work instead, twoted adv advantages antages and one parameter for each possible configuration of the variables involved. By ( x iantages . . , xobtained: | x i−1, . are 1. The parametrization of each network ork with using a neural network instead, twoPadv 1) by a neural netw ( i − 1) × k inputs and k outputs (if the variables are discrete and tak takee k P ( x , . . . , x x 1. v The parametrization of each ) b y a neural netw ork with alues, enco one-hot) allo one to estimate the conditional probability encoded ded allows ws (without konential k i 1) requiring k inputsan and outputsnum (if variables are discrete and takeyet | the exp exponential number ber of parameters (and examples), values, encotoded one-hot) allows one to estimate the conditional probability − is able × still capture high-order dep dependencies endencies bet etw ween the random variables. without requiring an exponential number of parameters (and examples), yet 2. Instead of having a different neural net netw work b for eac each h xi , still is able to capture high-order dependencies etwthe eenprediction the randomofvariables. 709 network for the prediction of each x , 2. Instead of having a different neural
CHAPTER 20. DEEP GENERATIVE MODELS
a left-to-right connectivit connectivity y illustrated in Fig. 20.9 allo allows ws one to merge all the neural netw networks orks into one. Equiv Equivalently alently alently,, it means that the hidden la lay yer a left-to-right connectivit y illustrated in Fig. 20.9 allo ws one to merge features computed for predicting x i can be reused for predicting xi+k ( k > all 0). the neural netw orks into one. Equiv alently , it means that the hidden la y ery The hidden units are th thus us organized in gr groups oups that ha have ve the particularit particularity features computed for predicting can b e reused for predicting ( 0 x x k > x 1, . . . , x). that all the units in the i -th group only dep depend end on the input values aluesx i. The parameters hidden unitsused are to thus organized in hidden groups that ve jointly the particularit The compute these unitsha are optimizedy . . . , xis. thatimpro all the in the i -th group end on in thethe input values x ,This to improve ve units the prediction of all only the dep variables sequence. Theinstance parameters used to principle compute these hiddenthroughout units are jointly optimized an of the reuse that recurs deep learning in to impro ve the prediction of all the v ariables in the sequence. This is scenarios ranging from recurrent and conv convolutional olutional netw network ork architectures to anulti-task instanceand of the reuselearning. principle that recurs throughout deep learning in m transfer scenarios ranging from recurrent and convolutional network architectures to mhulti-task transfer learning. P (x i | xand Eac Each representt a conditional distribution by having i−1, . . . , x 1) can represen
outputs of the neural netw network ork predict par arameters ameters of the conditional distribution x P ( x , . . . , x Eac h ) can represen t a conditional by having of xi, as discussed in Sec. 6.2.1.1. Although the original distribution neural auto-regressive outputs ork predict parcon ameters of purely the conditional distribution net netw worksofwthe ere |neural initiallynetw ev evaluated aluated in the context text of discrete m ultiv ultivariate ariate x of , as discussed in Sec. 6.2.1.1 . Although the original neural auto-regressive data (with a sigmoid output for a Bernoulli variable or softmax output for a net works wvere initially aluatedtoinextend the con textmodels of purely discrete ariate m ultinoulli ariable) it isevnatural such to con contin tin tinuous uousmvultiv ariables or data (with a sigmoid output for a Bernoulli v ariable or softmax output for a join jointt distributions in inv volving both discrete and con contin tin tinuous uous variables. multinoulli variable) it is natural to extend such models to continuous variables or joint distributions involving both discrete and continuous variables.
20.10.10
NADE
20.10.10 NADE The neur neural al autor autore egr gressive essive density estimator (NADE) is a very successful recent form of neural auto-regressive netw network ork (Laro Larochelle chelle and Murra Murray y, 2011). The connectivit connectivity y The neur al autor gressive density estimator (NADE) is a every successful recent form is the same as efor the original neural auto-regressiv auto-regressive net netw work of Bengio and of neural auto-regressive netw orkduces (Laroan chelle and Murra y, 2011).sharing The connectivit Bengio (2000b ) but NADE in intro tro troduces additional parameter sc scheme, heme, asy is the same the. original neural auto-regressiv netwofork of Bengio and j illustrated in as Fig.for20.10 The parameters of the hidden eunits differen different t groups Bengio ( 2000b ) but NADE in tro duces an additional parameter sharing sc heme, as are shared. illustrated in Fig. 20.10 . The parameters of the hidden units of different groups j 0 The weigh eights ts W j,k,i from the i-th input xi to the k -th elemen elementt of the j -th group are shared. (j ) i) are among thethe groups: of hidden unittshkW (j ≥ i-th input x to k -th element of the j -th group The weigh from theshared of hidden unit h
(j
0 among the groups: i) are shared Wj,k,i = Wk,i .
≥ W . The remaining weigh eights, ts, where j
(20.84) (20.84)
Laro Larocchelle and Murray (2011) chose this sharing scheme so that forward The remaining weights, where j < i, are zero. propagation in a NADE mo model del lo loosely osely resembles the computations performed in Laro c helle and Murray ( 2011 ) chose scheme so field that inference forward mean field inference to fill in missing inputsthis in ansharing RBM. This mean propagation in a NADE mo del lo osely resembles the computations p erformed in corresp corresponds onds to running a recurrent net netw work with shared weigh eights ts and the first step mean inference to same fill in as missing inputsThe in an RBM. This mean of thatfield inference is the in NADE. only difference is thatfield withinference NADE, corresp ondsweigh to running a recurrent work units with shared eights are andparametrized the first step the output weights ts connecting the net hidden to the w output of that inference is the same as in NADE. The only difference is that with NADE, 710 units to the output are parametrized the output weights connecting the hidden
CHAPTER 20. DEEP GENERATIVE MODELS
P (x3 | x1 , x2 )
P (x1 ) P (x2 | x1 )
P (x4 | x1 , x2 , x3 )
h1
W:,1
h2
W:,1
h3
W:,1 W:,2
x1
x2
W:,2
W:,3 x3
x4
Figure 20.10: An illustration of the neural autoregressive densit density y estimator (NADE). The (j ) hidden units are organized in groups h so that only the inputs x 1, . . . , x i participate Figure 20.10: An of the autoregressive The in computing and predicting NADE is (NADE). differen differentiated tiated h (i)illustration P (neural x j | x j−1 , . . . , x1 ), fordensit j > iy. estimator h orks x , . . . ,weigh x participate hidden units neural are organized in groups so that only from earlier auto-regressive netw networks by the usethe of inputs a particular weight t sharing P (x x in the , . . .fi,gure x ), bfor j >use i . NADE 0 h= W in computing and predicting is differen tiated pattern: shared (indicated y the of the same line pattern Wj,k,i k,i is from earlier neural netw the usetsofgoing a particular sharing for every instance of auto-regressive a replicated weigh weight) t)| orks for allbythe weigh eights out fromweigh xi tot the k -th W group =W pattern: shared that (indicated in the(W figure b y the use of the same line pattern , W , . . . , W unit of any j ≥ iis. Recall the vector ) is denoted W 1,i 2,i n,i :,i. for every instance of a replicated weight) for all the weights going out from x to the k -th unit of any group j i. Recall that the vector (W , W , . . . , W ) is denoted W . indep independen enden endently tly from≥the weigh weights ts connecting the input units to the hidden units. In
the RBM, the hidden-to-output weigh eights ts are the transp transpose ose of the input-to-hidden indep enden tlyNADE from the weights connecting the input units to the hidden units.step In w eigh eights. ts. The architecture can be extended to mimic not just one time thethe RBM, hidden-to-output weigh ts to aremimic the transp ose This of the input-to-hidden k steps. of meanthe field recurren recurrentt inference but approach is called w eigh ts. The NADE architecture can b e extended to mimic not just one time step NADE-k (Raik Raiko o et al. al.,, 2014). of the mean field recurrent inference but to mimic k steps. This approach is called As mentioned previously , auto-regressiv auto-regressivee net netw works may be extend to pro process cess NADE-mentioned k (Raiko etpreviously, al., 2014). con contin tin tinuous-v uous-v uous-valued alued data. A particularly pow powerful erful and generic way of parametrizing a Asuous mendensit tionedypreviously , auto-regressiv e networks in may be3.9.6 extend to mixture process con contin tin tinuous density is as a Gaussian mixture (introduced Sec. ) with con tin uous-v alued data. A particularly pow erful and generic w a y of parametrizing weigh eights ts αi (the coefficient or prior probability for comp componen onen onentt i ), per-comp per-component onenta continuous densit as a pGaussian mixture (introduced in Sec.σ3.9.6 ) model with 2 conditional meanyµisi and er-comp er-componen onen onent t conditional variance delmixture called i . A mo w eights α(Uria (the et coefficient priorthis probability for compto onen t i ), per-comp onent RNADE al., 2013or ) uses parametrization extend NADE to real µ σ conditional mean and p er-comp onen t conditional v ariance . A mo del called values. As with other mixture densit density y net netw works, the parameters of this distribution RNADE ( Uria et al. , 2013 ) uses this parametrization to extend NADE to by real are outputs of the net netw work, with the mixture weigh weightt probabilities pro produced duced a vsoftmax alues. As with other mixture densit y net w orks, the parameters of this distribution unit, and the variances parametrized so that they are positiv ositive. e. Sto Stochastic chastic are outputs of the with the mixture probetw duced gradien gradient t descen descent t cannet bewnork, umerically ill-b ill-beha eha ehav vedweigh due tot probabilities the in interactions teractions between eenby thea softmax unit,means and the variances parametrized so thatσthey are positive. Sto chastic, 2 µi and conditional the conditional variances difficulty, i . To reduce this difficulty gradien descen numerically ill-beha due to the the gradien interactions betw een the Uria et tal. (2013t )can usebae pseudo-gradient thatved replaces gradient t on the mean, in µ σ conditional means and the conditional v ariances . T o reduce this difficulty , the bac back-propagation k-propagation phase. Uria et al. (2013) use a pseudo-gradient that replaces the gradient on the mean, in 711 the back-propagation phase.
CHAPTER 20. DEEP GENERATIVE MODELS
Another very in interesting teresting extension of the neural auto-regressiv auto-regressivee arc architectures hitectures gets rid of the need to choose an arbitrary or order der for the observ observed ed variables (Murray very interesting extension eofnetw the orks, neural e arc hitectures andAnother Laro Larochelle chelle , 2014 ). In auto-regressiv auto-regressive networks, theauto-regressiv idea is to train the netw network ork gets rid of the need to choose an arbitrary or der for the observ ed v ariables ( Murray to be able to cop copee with any order by randomly sampling orders and providing the and Laro chelle 2014). In auto-regressiv networks, theinputs idea isare to observed train the netw ork information to, hidden units sp specifying ecifyingewhich of the (on the to bte side able of to the copeconditioning with any order byand randomly and providing the righ right bar) whic which h sampling are to beorders predicted and are th thus us information to hidden units sp ecifying which of the inputs are observed (on the considered missing (on the left side of the conditioning bar). This is nice because righ t side of the conditioning and whiceh netw are to and infer are enc thuse it allo allows ws one to use a trained bar) auto-regressiv auto-regressive network orkbetopredicted perform any inferenc ence considered missing (on the left side of the conditioning bar). This is nice b ecause pr problem oblem (i.e. predict or sample from the probabilit probability y distribution over any subset it allo ws onegiven to use trained extremely auto-regressiv e netw to p, erform anyy infer encofe of variables anya subset) efficiently efficiently. . ork Finally Finally, since man many orders oblem (i.e. the probabilit distribution over anyyields subset o of variables vpr ariables are predict possibleor(nsample ! for n from variables) and eac each hy order a of v ariables given any subset) extremely efficiently . Finally , since man y orders of differen differentt p(x | o), we can form an ensemble of mo models dels for many values of o: variables are possible (n ! for n variables) and each order o of variables yields a kof mo dels for many values of o: different p(x o), we can form an ensemble 1X p(x | o (i)). p ensemble(x) = (20.85) | k i=1 1 p(x o ). p (x) = (20.85) This ensem ensemble ble model usually generalizeskbetter and assigns higher probabilit probability y to | the test set than do does es an individual mo model del defined by a single ordering. This ensemble model usually generalizes better and assigns higher probability to In the same pap paper, er, the authors propose deep versions of the architecture, but the test set than does an individual modelX defined by a single ordering. unfortunately that immediately makes computation as exp expensive ensive as in the original In the same pap er, the authors propose deep versions of the architecture, but neural auto-regressiv auto-regressivee neural netw network ork (Bengio and Bengio, 2000b ). The first la lay yer unfortunately that immediately makes computation as exp ensive as in the original and the output la lay yer can still be computed in O(nh ) multiply-add op operations, erations, neural auto-regressiv e neural netw (Bengio , 2000b Thesize firstoflathe yer h ork as in the regular NADE, where is the num numb band er ofBengio hidden units).(the and theh output layer can still be computed in O(nh ) 2multiply-add operations, groups i , in Fig. 20.10 and Fig. 20.9), whereas it is O( n h) in Bengio and Bengio h is the in the regular wherehidden er of hidden units the 2 ) ifof (n2 hsize (as 2000b ). Ho How wev ever, er,NADE, for the other la lay ynum ers, bthe computation is O(the every h O ( n h groups , in Fig. 20.10 and Fig. 20.9 ), whereas it is ) in Bengio and Bengio “previous” group at lay layer er l participates in predicting the “next” group at lay layer er l + 1, O ( n h (assuming 2000b). Ho ever, for other units hidden yers,la the computation is ) if every nw i groups of hthe hidden at la each er. Making the -th group at lay lay y layer er “previous” group at lay er participates in predicting the “next” group at lay er + l l l + 1 only dep depend end on the i -th group, as in Murra Murray y and Laro Larochelle chelle (2014) at lay layer er1l, assumingitntogroups hidden unitsh at eachwla yer.than Making the i -th NADE. group at layer reduces O(nh2of ), h whic which h is still times orse the regular l + 1 only depend on the i -th group, as in Murray and Larochelle (2014) at layer l reduces it to O(nh ), which is still h times worse than the regular NADE.
20.11
Dra Drawing wing Samples from Auto Autoenco enco encoders ders
In Chapter Dra 14, we saw that many kinds of autoenco autoencoders ders learnders the data distribution. 20.11 wing Samples from Auto enco There are close connections betw etween een score matc matching, hing, denoising auto autoenco enco encoders, ders, and In Chapter 14 , we saw that many kinds of autoenco ders learn the data distribution. con contractiv tractiv tractivee auto autoenco enco encoders. ders. These connections demonstrate that some kinds of There areders closelearn connections etween scoreinmatc hing, and auto autoenco enco encoders the databdistribution some waydenoising . We hav haveeauto notenco yet ders, seen ho how w con tractiv e auto enco ders. These connections demonstrate that some kinds of to dra draw w samples from suc such h mo models. dels. autoencoders learn the data distribution in some way. We have not yet seen how Some kinds of auto autoenco enco encoders, ders, such as the variational autoenco autoencoder, der, explicitly to draw samples from such models. Some kinds of autoencoders, such 712 as the variational autoencoder, explicitly
CHAPTER 20. DEEP GENERATIVE MODELS
represen representt a probabilit probability y distribution and admit straightforw straightforward ard ancestral sampling. Most other kinds of auto autoenco enco encoders ders require MCMC sampling. represent a probability distribution and admit straightforward ancestral sampling. Con Contractiv tractiv tractivee autoenco autoencoders ders are designed to reco recov ver an estimate of the tangen tangentt Most other kinds of autoencoders require MCMC sampling. plane of the data manifold. This means that rep repeated eated enco encoding ding and deco decoding ding with Con tractiv e autoenco ders are designed to reco v er an estimate of the injected noise will induce a random walk along the surface of the manifoldtangen (Rifait plane the; data manifold. This).means repeated encoding and deco with et al., of 2012 Mesnil et al., 2012 This that manifold diffusion technique is ding a kind of injected noise will induce a random w alk along the surface of the manifold ( Rifai Mark Marko ov chain. et al., 2012; Mesnil et al., 2012). This manifold diffusion technique is a kind of There is also a more general Marko Markov v chain that can sample from any denoising Markov chain. auto autoenco enco encoder. der. There is also a more general Markov chain that can sample from any denoising autoencoder.
20.11.1
Mark Marko ov Chain Associated with any Denoising Autoenco coder der 20.11.1 Markov Chain Associated with any Denoising AutoenThe ab abo ove discussion open en the question of what noise to inject and where, in coder left op
order to obtain a Mark Markoov chain that would generate from the distribution estimated The aboauto ve discussion left opet en al. the(2013c question of ed what to inject suc andh where, inv by the autoenco enco encoder. der. Bengio ) show showed hownoise to construct such a Marko Markov a Mark ov chain that would generate from thedenoising distribution estimated corder hain to forobtain gener generalize alize alized d denoising auto autoenc enc enco oders ders. . Generalized auto autoenco enco encoders ders b y the auto enco der. Bengio et al. ( 2013c ) show ed how to construct suc h a Marko v are sp specified ecified by a denoising distribution for sampling an estimate of the clean input cgiv hain for gener alize d denoising auto enc o ders . Generalized denoising auto enco ders given en the corrupted input. are specified by a denoising distribution for sampling an estimate of the clean input Eac Each h step of the input. Marko Markov v chain that generates from the estimated distribution given the corrupted consists of the follo following wing sub-steps, illustrated in Fig. 20.11: Each step of the Markov chain that generates from the estimated distribution ˜ from x, injectincorruption 1. Starting the previous state sampling x consists of the from following sub-steps, illustrated Fig. 20.11noise, : C (x ˜ | x). ˜ from 1. Starting from the previous state x, inject corruption noise, sampling x ˜ ˜ 2. C Enco Encode de x in into to h = f ( x ) . (x ˜ x). ˜). 3. Decode h totoobtain ω = g (h) of p(x | ω = g(h)) = p(x | x ˜ in ˜ )parameters 2. Deco Enco|de x h = fthe (x . 4. state from p(x | ω = pp((xx | ω ˜ )= x . g(h)) = p(x x Decode the ˜). 3. Sample h tonext obtain thexparameters ω= = gg((h h)) ) of | xx Bengio et al. the (2014 ) show showed edxthat autoencoder der=pp(x(x| |˜ 4. Sample next state fromif pthe (x autoenco ω = g(h)) ˜)).forms a consistent
estimator of the corresp corresponding onding true conditional distribution, then the stationary | |x ˜ p ( x Bengio et al. ( 2014 ) show ed that if the autoenco der ) forms a consistent distribution of the ab abov ov ovee Mark Markov ov chain forms a consisten consistentt estimator (albeit an estimator of the corresp onding true conditional distribution, then the stationary | implicit one) of the data generating distribution of x. distribution of the above Markov chain forms a consistent estimator (albeit an implicit one) of the data generating distribution of x.
20.11.2
Clamping and Conditional Sampling
20.11.2 toClamping and Conditional Sampling Similarly Boltzmann mac machines, hines, denoising auto autoenco enco encoders ders and their generalizations (suc (such h as GSNs, describ described ed belo elow) w) can be used to sample from a conditional distriSimilarly to Boltzmann mac hines, denoising autoenco ders x andand theironly generalizations bution p( xf | x o), simply by clamping the observe observed d units resampling f (such as GSNs, described below) can be used to sample from a conditional distri713 observed units x and only resampling bution p( x x ), simply by clamping the |
CHAPTER 20. DEEP GENERATIVE MODELS
h g
f
!
x˜ x | x) C(˜
p(x | ! )
x
xˆ
Figure 20.11: Eac Each h step of the Marko Markov v chain asso associated ciated with a trained denoising auto autoenenco coder, der, that generates the samples from the probabilistic mo model del implicitly trained by the Figure 20.11: Each step of the Marko vh cstep hainconsists asso ciated with a trained denoising auto endenoising log-likelihoo log-likelihood d criterion. Eac Each in (a) injecting noise via corruption co der, samples the ding probabilistic mo del implicitly trained ˜ ), pro process cessthat state x, the yielding (b) enco encoding it with function C ingenerates x˜, from f, yielding h =by f (the x denoising log-likelihoo d criterion. Eac h step consists in (a) injecting noise via corruption (c) decoding the result with function g, yielding parameters ω for the reconstruction ˜ ), x˜, (b) enco f, yieldingdistribution h = f (x pro cess C in and state yielding ding it with distribution, (d)x,given a new state from function the reconstruction ω, sampling g ω (c) result withtypical function , yielding parameters the reconstruction p (x decoding (x ˜ ))) g (h ) = xˆ, which ))). . In the squared reconstruction error for case, | ω = g (fthe ω distribution, and (d) given , sampling a new state from the reconstruction distribution ˜ ], corruption consists in adding Gaussian noise and sampling estimates E [x | x from p ( = g ( f ( x ˜ g ( h ) = xˆ,xˆwhich ))) x ω . In the typical squared reconstruction error case, p (x | ω ) consists in adding Gaussian noise, a second time, to the reconstruction . The E ˜ ], corruption [x should x estimates consists adding Gaussian and sampling from | noise level latter correspond to the in mean squared error ofnoise reconstructions, whereas p (x injected ω ) consists Gaussian noise, a second time, the reconstruction . The | in isadding the noise a hyperparameter that con controls trols the to mixing sp speed eed as wellxˆas the latter levelthe should correspond to the squared error of reconstructions, | tnoise exten extent to which estimator smo smooths oths themean empirical distribution (Vincen Vincentt, 2011).whereas In the the injected noise ishere, a hyperparameter controls the mixing sp eed ell asg the example illustrated only the C and pthat conditionals are sto stochastic chastic stepsas(fwand are exten t to which the estimator smo oths the empirical distribution ( Vincen t , 2011 ). In the deterministic computations), although noise can also be injected inside the auto autoenco enco encoder, der, C p f g example illustrated here, only the conditionals as in generative stochastic net netw w orks and (Bengio et al., 2014are ). sto chastic steps ( and are deterministic computations), although noise can also be injected inside the auto enco der, as in generative stochastic networks (Bengio et al., 2014).
714
CHAPTER 20. DEEP GENERATIVE MODELS
the fr freee units x o giv given en xf and the sampled latent variables (if any). For example, MP-DBMs can be interpreted as a form of denoising auto autoenco enco encoder, der, and are able x x the fr e e units giv en and the sampled latent v ariables (if any). Forpresen example, to sample missing inputs. GSNs later generalized some of the ideas present t in MP-DBMs can b e interpreted as a form of denoising auto enco der, and are able MP-DBMs to perform the same op operation eration (Bengio et al., 2014). Alain et al. (2015) to sample missing inputs. GSNs generalized some ofetthe presen t in iden identified tified a missing condition fromlater Prop Proposition osition 1 of Bengio al. ideas (2014), which is MP-DBMs to perform the same operation (Bengio et al., 2014 ). Alain et from al. (2015 that the transition op operator erator (defined by the stochastic mapping going one) identified a missing from Propsatisfy ositiona1prop of Bengio et al. detaile (2014),d which is state of the chain tocondition the next) should property erty called detailed balanc alance e, that the transition op erator (defined b y the stochastic mapping going from one whic which h sp specifies ecifies that a Mark Marko ov Chain at equilibrium will remain in equilibrium state of the chain to the next) a prop erty called detailed balance, whether the transition op operator erator should is run insatisfy forw forward ard or rev reverse. erse. which specifies that a Markov Chain at equilibrium will remain in equilibrium An exp experimen eriment t in clamping of the pixels right part of the image) and whether theerimen transition operatorhalf is run in forw ard(the or rev erse. running the Mark Marko ov chain on the other half is sho shown wn in Fig. 20.12. An experiment in clamping half of the pixels (the right part of the image) and running the Markov chain on the other half is shown in Fig. 20.12.
Figure 20.12: Illustration of clamping the right half of the image and running the Mark Marko ov Chain by resampling only the left half at eac each h step. These samples come from a GSN Figure 20.12: Illustration of clamping theeac right halfstep of the image running the Markov trained to reconstruct MNIST digits at each h time using theand walkbac walkback k pro procedure. cedure. Chain by resampling only the left half at each step. These samples come from a GSN trained to reconstruct MNIST digits at each time step using the walkback pro cedure.
20.11.3
Walk-Back Training Pro Procedure cedure
The walk-bac alk-back kalk-Back training pro procedure wasPro proposed by Bengio et al. (2013c) as a way 20.11.3 W Tcedure raining cedure to accelerate the con conv vergence of generativ generativee training of denoising auto autoenco enco encoders. ders. The w alk-bac k training pro cedure was proposed b y Bengio et al. ( 2013c ) as a way Instead of performing a one-step enco encode-deco de-deco de-decode de reconstruction, this pro procedure cedure to accelerate the con v ergence of generativ e training of denoising auto enco ders. consists in alternativ alternativee multiple sto stocchastic enco encode-deco de-deco de-decode de steps (as in the generative Instead of performing a one-step encode-decode reconstruction, this procedure consists in alternative multiple stochastic 715enco de-deco de steps (as in the generative
CHAPTER 20. DEEP GENERATIVE MODELS
Mark Marko ov chain) initialized at a training example (just like with the con contrastive trastive div divergence ergence algorithm, describ described ed in Sec. 18.2) and penalizing the last probabilistic Markov chain) initialized at reconstructions a training example reconstructions (or all of the along(just the wlike ay).with the contrastive divergence algorithm, described in Sec. 18.2) and penalizing the last probabilistic k steps Training with(or equiv equivalent alent (in the sense ac achieving hieving reconstructions all ofisthe reconstructions alongofthe way). the same stationary distribution) as training with one step, but practically has the adv advan an antage tage that k stepsfrom Training withfarther is equiv of achieving the same spurious mo modes des thealent data(in canthe besense remo remov ved more efficien efficiently tly tly.. stationary distribution) as training with one step, but practically has the advantage that spurious modes farther from the data can be removed more efficiently.
20.12
Generativ Generative e Sto Stocchastic Net Netw works
Gener Generative ative Generativ sto stochastic chastic networks or cGSNs (Bengio , 2014) are generalizations of 20.12 e Sto hastic Netetwal. orks denoising auto autoenco enco encoders ders that include latent variables h in the generativ generativee Marko Markov v Gener ative sto chastic networks or GSNs ( Bengio et al. , 2014 ) are generalizations of chain, in addition to the visible variables (usually denoted x). denoising autoencoders that include latent variables h in the generative Markov A GSN is parametrized by vtw two o conditional chain, in addition to the visible ariables (usuallyprobability denoted x).distributions which sp specify ecify one step of the Mark Marko ov chain: A GSN is parametrized by two conditional probability distributions which specify the Mark ov generate chain: the next visible variable giv x(k) step | h(k)of 1. p(one ) tells ho how w to given en the curren currentt laten latentt state. Such a “reconstruction distribution” is also found in denoising hders, 1. auto p( x enco ) tells how DBNs to generate the next visible variable given the current autoenco encoders, RBMs, and DBMs. latent state. Such a “reconstruction distribution” is also found in denoising | k) | h (k−1), x (k−1)) tells ho 2. auto p( h (enco w to up update date the laten latentt state variable, giv given en ders, RBMs, DBNshow and DBMs. the previous laten latentt state and visible variable. ,x h 2. p( h ) tells how to update the latent state variable, given the previous laten t state and visible variable. Denoising autoencoders ders and GSNs differ from classical probabilistic mo models dels | autoenco (directed or undirected) in that they parametrize the generative process itself rather autoencoders and GSNs differ from classical probabilistic dels thanDenoising the mathematical sp specification ecification of the joint distribution of visible andmo latent (directed undirected) in thatisthey parametrize the, generative rather v ariables.orInstead, the latter defined if it exists,process as the itself stationary implicitly than the mathematical sp ecification of the joint distribution of visible and latent distribution of the generative Marko Markov v chain. The conditions for existence of the vstationary ariables. Instead, the are latter is defined , if it exists , as thebstationary distribution mild and are implicitly the same conditions required y standard distribution of the(see generative Marko v chain. The conditions for existence of the MCMC methods Sec. 17.3 ). These conditions are necessary to guarantee stationary distribution are they mild can and bare the samebyconditions required standard that the chain mixes, but e violated some choices of thebytransition MCMC methods (see Sec. 17.3 ). These conditions are necessary to guarantee distributions (for example, if they were deterministic). that the chain mixes, but they can be violated by some choices of the transition One could imagine different training criteria for GSNs. The one prop proposed osed and distributions (for example, if they were deterministic). ev evaluated aluated by Bengio et al. (2014) is simply reconstruction log-probability on the Oneunits, couldjust imagine different training criteria for This GSNs. one osed and visible like for denoising auto autoenco enco encoders. ders. is The achiev achieved ed prop by clamping ev aluated by Bengio et al. ( 2014 ) is simply reconstruction log-probability on the x(0) = x to the observed example and maximizing the probabilit x probability y of generating visible units, just like for denoising auto enco ders. This achiev ed (kis ) = (k) by clamping (k) log p ( x h x | h at some subsequen subsequentt time steps, i.e., maximizing ), where xis sampled = x to from x the observed example thetoprobabilit of generating x(0) maximizing = x . In order the chain, given and estimatey the gradient of log p(xof the =x h )Bengio at some i.e.,tomaximizing , where eth al. log p(x(k)subsequen = x | h(kt) )time withsteps, resp respect ect the other pieces mo model, del, x = x is sampled from the c hain, given . In order to estimate the gradient of (2014) use the reparametrization tric trick, k, in intro tro troduced duced in Sec. 20.9.| log p(x = x h ) with respect to the other pieces of the model, Bengio et al. (2014) use the |reparametrization trick, 716 introduced in Sec. 20.9.
CHAPTER 20. DEEP GENERATIVE MODELS
The walk-b walk-back ack tr training aining proto protocol col (describ (described ed in Sec. 20.11.3) was used (Bengio et al. al.,, 2014) to improv improvee training conv convergence ergence of GSNs. The walk-back training protocol (described in Sec. 20.11.3) was used (Bengio et al., 2014) to improve training convergence of GSNs.
20.12.1
Discriminan Discriminantt GSNs
The originalDiscriminan form formulation ulation of tGSNs (Bengio et al., 2014) was mean meantt for unsup unsupervised ervised 20.12.1 GSNs learning and implicitly modeling p (x) for observed data x, but it is possible to The original formulation GSNs (Bengio mo modify dify the framew framework ork toofoptimize p(y | xet ). al., 2014) was meant for unsupervised learning and implicitly modeling p (x) for observed data x, but it is possible to For example, Zhou Troy royansk ansk anska modify the framew ork and to optimize pa(yya (2014 x). ) generalize GSNs in this way, by only bac back-propagating k-propagating the reconstruction log-probability over the output variables, keepor example, Zhou and Troy anskaapplied ya |(2014this ) generalize GSNs this ay, by only ing F the input variables fixed. They successfully to in mo model del w sequences back-propagating reconstruction log-probability over the outputcon variables, keep(protein secondarythe structure) and in intro tro troduced duced a (one-dimensional) conv volutional sequences ing the input variables fixed.operator They applied successfully toismo delortant structure in the transition of the this Mark Marko ov chain. It imp important to recon v olutional (protein secondary structure) and in tro duced a (one-dimensional) mem memb ber that, for each step of the Mark Markov ov chain, one generates a new sequence for structure in the transition operator of the for Mark ov chain.other It is la imp reeac each h la lay yer, and that sequence is the input computing lay yerortant valuesto(sa (say y mem b er that, for each step of the Mark ov c hain, one generates a new sequence for the one belo elow w and the one ab abo ove) at the next time step. each layer, and that sequence is the input for computing other layer values (say the Marko ov cone hain ov over ernext the time associated ciated output variable (and asso the Hence one belo w Mark and the abisovreally e) at the step. higher-lev higher-level el hidden la lay yers), and the input sequence only serves to condition that output variable Hence the Mark o v chain allowing is really to ovlearn er theho (and ciated chain, with bac back-propagation k-propagation how w the input sequence canasso condition higher-lev hidden layers), and the input sequence serves to condition that the outputeldistribution implicitly represented by the only Marko Markov v chain. It is therefore back-propagation allowing to learn how the input sequence can condition p (y | x) achain, case with of using the GSN in the context of structured , where outputs the output distribution implicitly represented b y the Marko v chain. It is therefore do does es not ha hav ve a simple parametric form but instead the comp components onents of y are a case of using the GSN in the context of structured outputs , statistically dep dependen enden endentt of eac each h other, giv given en x, in complicated ways.where p (y x) does not have a simple parametric form but instead the components of y |are Zöhrer and Pernkopf in introduced troduced mo model del thatwacombines a superstatistically depPernk endenopf t of(2014 each)other, given axhybrid , in complicated ys. vised ob objectiv jectiv jectivee (as in the ab abov ov ovee work) and an unsup unsupervised ervised ob objective jective (as in the Zöhrer andwork), Pernkby opfsimply (2014)adding introduced moweigh del that combines a superoriginal GSN (withaahybrid different weight) t) the sup supervised ervised and vised ob jectiv e (as in the ab ov e w ork) and an unsup ervised ob jective (as in the unsup unsupervised ervised costs i.e., the reconstruction log-probabilities of y and x resp respectiv ectiv ectively ely ely.. original GSN work), by simply adding (with a different weigh t) the sup ervised and Suc a hybrid criterion had previously b een in for RBMs by Laro Such h intro tro troduced duced Larochelle chelle y and x resp unsup ervised(2008 costs).i.e., the reconstruction ectivthis ely. and Bengio They sho show w impro improved vedlog-probabilities classification pof erformance using Suc h a hybrid criterion had previously been introduced for RBMs by Larochelle sc scheme. heme. and Bengio (2008). They show improved classification performance using this scheme.
20.13
Other Generation Sc Schemes hemes
The metho methods ds we ha hav ve describ described ed so far use either MCMC sampling, ancestral 20.13 Other Generation Schemes sampling, or some mixture of the tw two o to generate samples. While these are the The metho ds we ha v e describ ed so farmo use eitherthey MCMC most popular approac approaches hes to generative modeling, deling, are bysampling, no meansancestral the only sampling, approac approaches. hes.or some mixture of the two to generate samples. While these are the most popular approaches to generative modeling, they are by no means the only approaches. 717
CHAPTER 20. DEEP GENERATIVE MODELS
Sohl-Dic Sohl-Dickstein kstein et al. (2015) developed a diffusion inversion training scheme for learning a generativ generativee model, based on non-equilibrium thermo thermodynamics. dynamics. The Sohl-Dic kstein et al. ( 2015 ) developed a diffusion inversion training approac approach h is based on the idea that the probability distributions we wish to scheme sample for learning a generativ e model, based on non-equilibrium thermo dynamics. The from hav havee structure. This structure can gradually be destroy destroyed ed by a diffusion approac h is based on the idea that the probability distributions we wish to sample pro process cess that incremen incrementally tally changes the probability distribution to ha hav ve more from hav eostructure. This estructure can becess destroy ederse, by b aydiffusion en entrop trop tropy y. T form a generativ generative mo model, del, we cangradually run the pro process in rev reverse, training pro cess that incremen tally changes the probability distribution to ha v e more a mo model del that gradually restores the structure to an unstructured distribution. By en trop y . T o form a generativ e mo del, we can run the pro cess in rev erse, b y training iterativ iteratively ely applying a pro process cess that brings a distribution closer to the target one, we a mo del that gradually restores the distribution. structure to an unstructured distribution. By can gradually approach that target This approach resembles MCMC iterativ elyinapplying process that distribution closer toathe targetHow one,ever, we metho methods ds the senseathat it in inv volv olves esbrings man many yaiterations to produce sample. However, can mo gradually approach distribution. This approach MCMC the model del is defined to bthat e thetarget probability distribution pro produced ducedresembles by the final step metho ds in the sense that it in v olv es man y iterations to produce a sample. How ever, of the chain. In this sense, there is no approximation induced by the iterative the cedure. model isThe defined to be introduced the probability distribution by the pro procedure. approach by Sohl-Dic Sohl-Dickstein ksteinpro etduced al. (2015 ) is final also vstep ery of thetochain. In this esense, there is no approximation induced by(Sec. the iterative close the generativ generative in interpretation terpretation of the denoising auto autoenco enco encoder der 20.11.1). proecedure. Thedenoising approachauto introduced y Sohl-Dic kstein et al. (2015) aistransition also very Lik Like with the autoenco enco encoder, der,bthe training ob objectiv jectiv jective e trains close to the generativ e interpretation of the denoising autoenco der (Sec. 20.11.1 ). op operator erator whic which h attempts to probabilistically undo the effect of adding some noise, Lik e with the denoising auto enco der, the training ob jectiv e trains a transition trying to undo one step of the diffusion pro process. cess. If we compare with the walkbac walkback k operator pro whic h attempts to probabilistically undo the effect of adding some noise, training (Sec. 20.11.3 ) for denoising auto and GSNs, the main procedure cedure autoenco enco encoders ders trying to undo step of diffusion pro cess.to Ifards we compare with training the walkbac k difference is thatone instead of the reconstructing only tow w the observed point training pro cedure (Sec. 20.11.3 ) forto denoising autoto enco dersthe andprevious GSNs, the main , the ob objective jective function only tries reconstruct tow wards point in x difference is that instead of reconstructing only to w ards the observed training p oint the diffusion tra trajectory jectory that started at x (whic (which h should be easier). This addresses , the ob jective function only tries to reconstruct wards the previous point in x the following dilemma presen presentt with the ordinary to reconstruction log-likelihoo log-likelihood d x (whic the diffusion tra jectoryauto that started should e easier). This addresses ob objectiv jectiv jective e of denoising autoenco enco encoders: ders: at with smallhlevels of bnoise the learner only sees the following dilemma presen t ts, with thewith ordinary reconstruction d configurations near the data poin oints, while large levels of noise it log-likelihoo is ask asked ed to do ob jectiv e of denoising auto enco ders: with small levels of noise the learner only sees an almost imp impossible ossible job (b (because ecause the denoising distribution is going to be highly configurations near the data With points, while with large levelsob ofjectiv noisee,itthe is ask ed tocan do complex and multi-modal). the diffusion inv learner inversion ersion objectiv jective, an almost ossible the job (b ecause distribution is going to as bewell highly learn moreimp precisely shape of the the denoising densit density y around the data points as complex and multi-modal). With the diffusion inv ersion ob jectiv e, the learner can remo remov ve spurious mo modes des that could sho show w up far from the data poin oints. ts. learn more precisely the shape of the density around the data points as well as Another approac approach h to sample generation is the appr approximate oximate Bayesian computaremove spurious modes that could show up far from the data points. tion (ABC) framew ( Rubin et al. , 1984 ). In this approach, samples are rejected framework ork Another approac h to sample generation is the appr oximate Bayesian omputaor modified in order to make the moments of selected functions of thecsamples tion (ABC) (Rubin et al., 1984While ). In this are rejected matc match h thoseframew of the ork desired distribution. thisapproach, idea uses samples the moments of the or modified in order to make the moments of selected functions of the samples samples lik likee in moment matc matching, hing, it is different from moment matc matching hing because it matc h those of the desired distribution. While this idea uses the moments of the mo modifies difies the samples themselves, rather than training the model to automatically samples like inwith moment matching, it ists.different from moment hing because it emit samples the correct momen moments. Bac Bachman hman and Precupmatc (2015 ) show showed ed ho how w mouse difies thefrom samples rather than learning, training the modelABC to automatically to ideas ABCthemselves, in the con context text of deep by using to shap shapee the emit samples with the correct momen ts. Bac hman and Precup ( 2015 ) show ed how MCMC tra trajectories jectories of GSNs. to use ideas from ABC in the context of deep learning, by using ABC to shape the We exp expect that many other possible approaches to generativ generativee modeling await MCMC traect jectories of GSNs. disco discov very ery.. We expect that many other possible approaches to generative modeling await 718 discovery.
CHAPTER 20. DEEP GENERATIVE MODELS
20.14
Ev Evaluating aluating Generativ Generative e Mo Models dels
Researc Researchers studying generativ generative e mo models dels e often generativee 20.14 hersEv aluating Generativ Moneed delsto compare one generativ mo model del to another, usually in order to demonstrate that a newly in inv ven ented ted generative Researc hers studying generativ e mo dels often need to compare one generativ e mo model del is better at capturing some distribution than the pre-existing mo models. dels. model to another, usually in order to demonstrate that a newly invented generative This can be a difficult and subtle task. In man many y cases, we can not actually model is better at capturing some distribution than the pre-existing models. ev evaluate aluate the log probabilit probability y of the data under the mo model, del, but only an approximation. This can b e a difficult and subtle task. In man y cases, can exactly not actually In these cases, it is imp important ortant to think and comm communicate unicate clearlywe about what ev aluate the log probabilit y of the data under the mo del, but only an approximation. is being measured. For example, supp suppose ose we can ev evaluate aluate a sto stochastic chastic estimate of In these cases, it is imp ortant to think and comm unicate clearly about exactly what the log-lik log-likeliho eliho elihoood for model A, and a deterministic low lower er bound on the log-likelihoo log-likelihood d is bmo eingdelmeasured. ForAexample, suppose wethan can ev aluate stochastic estimate for model B. If mo model del gets a higher score mo model del B,a which is better? If wofe the log-lik eliho od for model A, and adel deterministic lowinternal er boundrepresen on the log-likelihoo d care ab about out determining which mo model has a better representation tation of the for mo del B. If mo del A gets a higher score than mo del B, which is b etter? If w distribution, we actually cannot tell, unless we ha hav ve some way of determining howe care outbound determining which del has etter internal of use the lo loose oseab the for mo model del B is.moHo How wev ever, er,aifbwe care ab about out represen how welltation we can distribution, e actually tell, weanomaly have some way of determining the mo model del in w practice, forcannot example to unless perform detection, then it is fairhow to lo ose the b ound for mo del B is. Ho w ev er, if we care ab out how well w e can use sa say y that a mo model del is preferable based on a criterion specific to the practical task of the mo del in practice, example performand anomaly detection, then it precision is fair to in interest, terest, e.g., based on for ranking test to examples ranking criteria suc such h as sa y that a mo del is preferable based on a criterion specific to the practical task of and recall. interest, e.g., based on ranking test examples and ranking criteria such as precision evaluating aluating generative mo models dels is that the ev evaluation aluation metrics andAnother recall. subtlety of ev are often hard researc research h problems in and of themselv themselves. es. It can be very difficult Another subtlety of ev aluating generative mo dels is the evaluation to establish that models are being compared fairly fairly.. Forthat example, supp suppose ose metrics we use are often hard researc h problems in and of themselv es. It can b e v ery difficult − log Z log Z log p ˜ ( x ) AIS to estimate in order to compute for a new model we to that models are being compared fairlyimplemen . For example, osema wye use ha hav vestablish e just inv invented. ented. A computationally economical implementation tation supp of AIS may fail log Z log Z log p ˜ ( x ) AIS to estimate in order to compute for a new model we to find sev several eral mo modes des of the mo model del distribution and underestimate Z , whic which h will ha v e just inv ented. A computationally economical implemen tation of AIS ma y fail − result in us ov overestimating erestimating log p(x). It can thus be difficult to tell whether a high to find sev eral mo des of the mo del distribution and underestimate Z , which will lik likeliho eliho elihoo od estimate is due to a go goo od mo model del or a bad AIS implemen implementation. tation. result in us overestimating log p(x). It can thus be difficult to tell whether a high Other fields of machine learning usually allo allow w for some variation in the prelikelihood estimate is due to a good model or a bad AIS implementation. pro processing cessing of the data. For example, when comparing the accuracy of ob object ject Other fields of machine learning usually allo w for some v ariation in the prerecognition algorithms, it is usually acceptable to preprocess the input images pro cessing of the Foralgorithm example,based whenoncomparing accuracy of ob ject sligh slightly tly differen differently tly data. for each what kindthe of input requirements recognition algorithms, it is is usually acceptable preprocess the input images it has. Generative mo modeling deling differen different t because to changes in preprocessing, even sligh tly differen tly for each algorithm based on what kind of input requirements very small and subtle ones, are completely unacceptable. An Any y change to the input it has. Generative mo deling is differen t b ecause c hanges in preprocessing, data changes the distribution to be captured and fundamentally alters the even task. vFery small and subtle ones, are completely unacceptable. An y c hange to the input or example, multiplying the input by 0.1 will artificially increase lik likelihoo elihoo elihood d by a data changes factor of 10. the distribution to be captured and fundamentally alters the task. For example, multiplying the input by 0.1 will artificially increase likelihood by a Issues with prepro preprocessing cessing commonly arise when benchmarking generative mo models dels factor of 10. on the MNIST dataset, one of the more popular generative modeling benchmarks. Issuesconsists with prepro cessing commonly arisemo when dels MNIST of grayscale images. Some models delsbenchmarking treat MNISTgenerative images as mo points on the MNIST dataset, one of the more popular generative modeling benchmarks. MNIST consists of grayscale images. Some 719 mo dels treat MNIST images as points
CHAPTER 20. DEEP GENERATIVE MODELS
in a real vector space, while others treat them as binary binary.. Yet others treat the gra grayscale yscale values as probabilities for a binary samples. It is essen essential tial to compare in a real v ector space, while others treat them as binary . Y et others treat the real-v real-valued alued mo models dels only to other real-v real-valued alued mo models dels and binary-v binary-valued alued mo models dels only graother yscalebinary-v values alued as probabilities for a binary samples.dsItmeasured is essential compare to binary-valued mo models. dels. Otherwise the likelihoo likelihoods aretonot on the real-v alued mo dels only to other real-v alued mo dels and binary-v alued mo dels only same space. For binary-v binary-valued alued models, the log-lik log-likelihoo elihoo elihood d can be at most zero, while to other binary-v alued dels.beOtherwise likelihoo are not on the for real-v real-valued alued mo models dels mo it can arbitrarilythe high, sinceds it measured is the measuremen measurement t of a same space. F or binary-v alued models, the log-lik elihoo d can b e at most zero, while densit density y. Among binary mo models, dels, it is important to compare mo models dels using exactly for real-v alued mo dels it can b e arbitrarily high, since it is the measuremen the same kind of binarization. For example, we migh mightt binarize a gra gray y pixel to t0 of or a1 densit y . Among binary mo dels, it is important to compare mo dels using exactly by thresholding at 0.5, or by dra drawing wing a random sample whose probability of being of binarization. For example, we migh t binarize a gray pixelwe to migh 0 or t1 1the is same given kind by the gra gray y pixel intensit intensity y. If we use the random binarization, might by thresholding at 0.5, or by drawing random sample whose random probability of being binarize the whole dataset once, or weamigh might t dra draw w a different example for 1eac is given by the gra y pixel intensit y . If w e use the random binarization, we migh each h step of training and then draw multiple samples for ev evaluation. aluation. Each of theset binarize the whole dataset once, or we migh t dra w a different random for three sc schemes hemes yields wildly different likelihoo likelihood d num umb bers, and whenexample comparing eac h step of dels training andortan then tdraw samples for same evaluation. Each of these differen different t mo models it is imp importan ortant that multiple both mo models dels use the binarization sc scheme heme three schemes wildly different likelihoo d hers numwho bers,apply and when comparing for training andyields for ev evaluation. aluation. In fact, researc researchers a single random differen t mo dels it is imp ortan t that b oth mo dels use the same binarization scheme binarization step share a file containing the results of the random binarization, so for training and for ev aluation. In fact, researc hers who apply a single random that there is no difference in results based on differen differentt outcomes of the binarization binarization step share a file containing the results of the random binarization, so step. that there is no difference in results based on different outcomes of the binarization Because being able to generate realistic samples from the data distribution step. is one of the goals of a generativ generativee mo model, del, practitioners often ev evaluate aluate generative Because b eing able to generate realistic samples from the data mo models dels by visually insp inspecting ecting the samples. In the best case, this is donedistribution not by the is one of thethemselv goals ofes,a but generativ e erimental model, practitioners often evaluate researc researchers hers themselves, by exp experimental sub subjects jects who do not know generative the source mo dels by visually insp ecting the samples. In the b est case, this is done not the of the samples (Den Denton ton et al. al.,, 2015). Unfortunately Unfortunately,, it is possible for a verybypo poor or researc hers themselv es, but by exp erimental sub jects who do not know the source probabilistic model to pro produce duce very go goo od samples. A common practice to verify of the samples ( Den ton et al. , 2015 ). Unfortunately , it isispillustrated ossible for in a very or. if the model only copies some of the training examples Fig. po 16.1 probabilistic duce good samples. A common practice to verify The idea is tomodel show to forpro some of very the generated samples their nearest neighbor in if the model only copies some of the training examples is illustrated in Fig. 16.1 the training set, according to Euclidean distance in the space of x. This test is . The idea is show the for some of the the generated theirtraining nearest set neighbor in in intended tended totodetect case where mo model del samples overfits the and just x the training set, according to Euclidean distance in the space of . This test is repro reproduces duces training instances. It is even possible to sim simultaneously ultaneously underfit and in detect the samples case where model ovlook erfitsgo the and just ovtended erfit yettostill pro thatthe individually Imagine set a generative produce duce goo o d. training repro training instances. It isand even possible to simultaneously underfit mo model delduces trained on images of dogs cats that simply learns to repro reproduce duce and the o v erfit y et still pro duce samples that individually look go o d. Imagine a generative training images of dogs. Suc Such h a model has clearly overfit, because it do does es not mo del trained on images of dogs and cats that simply learns to repro duce the pro produces duces images that were not in the training set, but it has also underfit, because training of dogs. to Suc h atraining model has clearly overfit, it do es not it assignsimages no probability the images of cats. Yetbecause a human observ observer er pro duces images were notimage in theof training it has alsoyunderfit, ecause w ould judge eac each hthat individual a dog set, to bbut e high qualit quality . In thisbsimple it assignsitno probability training images of cats. Yetecta man human observto er example, would be easy to forthe a human observ observer er who can insp inspect many y samples w ould judge eac h individual image of a dog to b e high qualit y . In this simple determine that the cats are absent. In more realistic settings, a generative mo model del example, it would be easy for a h uman observ er who can insp ect man y samples trained on data with tens of thousands of modes may ignore a small num numb ber to of determine the cats are absent. Innot more realistic settings, a generative mobdel mo modes, des, andthat a human observer would easily be able to inspect or remem rememb er trained on data with tens of thousands of modes may ignore a small number of modes, and a human observer would not 720 easily b e able to inspect or rememb er
CHAPTER 20. DEEP GENERATIVE MODELS
enough images to detect the missing variation. Since the visual qualit quality y of samples is not a reliable guide, we often also enough images to detect the missing variation. ev evaluate aluate the log-likelihoo log-likelihood d that the mo model del assigns to the test data, when this is Since the visual qualit y of samples reliable we computationally feasible. Unfortunately Unfortunately,, is in not somea cases the guide, lik likeliho eliho elihoo od often seems also not ev aluate the log-likelihoo d that the mo del assigns to the test data, when this is to measure any attribute of the mo model del that we really care about. For example, computationally feasible. Unfortunately in some cases likelihodobdyseems not real-v real-valued alued mo models dels of MNIST can obtain, arbitrarily highthe likelihoo likelihood assigning to measure low any vattribute the model that wethat really care cabout. or dels example, arbitrarily ariance toofbackground pixels never hange. FMo Models and real-v alued mo dels of MNIST can obtain arbitrarily high likelihoo d b y assigning algorithms that detect these constant features can reap unlimited rewards, even arbitrarily variance background pixels change. Models and though thislow is not a verytouseful thing to do. that The never poten otential tial to achiev achieve e a cost algorithms detect these yconstant features can reap unlimited rewards, even approac approaching hingthat negativ negative e infinit infinity is presen present t for an any y kind of maxim maximum um lik likeliho eliho elihoo od though this is not a very useful thing to do. The p oten tial to achiev e a cost problem with real values, but it is esp especially ecially problematic for generative models of approac hing negativ e infinit y is presen t for are anytrivial kind to of predict. maximum elihood MNIST because so many of the output values Thislikstrongly problem with real v alues, but it is esp ecially problematic for generative models of suggests a need for dev developing eloping other ways of ev evaluating aluating generative mo models. dels. MNIST because so many of the output values are trivial to predict. This strongly Theis et al. (2015) review man many y of the issues inv involv olv olved ed in ev evaluating aluating generative suggests a need for developing other ways of evaluating generative models. mo models, dels, including man many y of the ideas described abov above. e. They highligh highlightt the fact Theis et al. ( 2015 ) review man y of the issues inv olv ed in ev aluating that there are man many y differen differentt uses of generativ generativee mo models dels and that thegenerative choice of mo dels, including man y of the ideas described abov e. They highligh t the facte metric must matc match h the in intended tended use of the model. For example, some generativ generative that there man different high usesprobability of generativ models and that the choice of mo models dels are are better atyassigning toe most realistic poin oints ts while other metric must matc the better intended of the model.high For example, some generative generativ generative e mo models delsh are at use rarely assigning probability to unrealistic mo dels better at assigning high probability to most realistic mo poin ts while other p oin oints. ts. are These differences can result from whether a generative model del is designed generativ e mo dels are better at rarely assigning high probability to unrealistic to minimize DKL(pdatakp model ) or D KL( pmodelkp data ), as illustrated in Fig. 3.6. p oints. These ,differences result from whether a generative model Unfortunately Unfortunately, even when can we restrict the use of each metric to the taskisitdesigned is most p p D ( p D ( p to minimize ) or ) , as illustrated in Fig. 3.6. suited for, all of the metrics curren currently tly in use con contin tin tinue ue to ha hav ve serious weaknesses. Unfortunately , even when e restrict the use to theistask it is most k wresearc k each metric One of the most imp important ortant research h topics in of generative modeling therefore not suited for, all of the metrics curren tly in use con tin ue to ha v e serious weaknesses. just ho how w to improv improvee generative mo models, dels, but in fact, designing new tec techniques hniques to One of the important research topics in generative modeling is therefore not measure ourmost progress. just how to improve generative models, but in fact, designing new techniques to measure our progress.
20.15
Conclusion
T raining generative mo models dels with hidden units is a powerful way to make mo models dels 20.15 Conclusion understand the world represented in the given training data. By learning a mo model del T raining generative mo dels with hidden units is a p o w erful w a y to make mo dels pmodel(x ) and a representation pmodel (h | x), a generative mo model del can pro provide vide understand the world represented in the given training data. By learning a mo del answ answers ers to man many y inference problems ab about out the relationships bet etw ween input variables h x p ( x p ( ) and a representation ) , a generative mo del can pro vide in x and can provide man many y different wa ways ys of represen representing ting x by taking exp expectations ectations answ to many inference about the bethold weenthe input variables | relationships h ers of at different lay layers ers of problems the hierarch hierarchy y. Generativ Generative e mo models dels promise to x x in and can provide man y different wa ys of represen ting b y taking exp ectations pro provide vide AI systems with a framew framework ork for all of the many differen differentt intuitiv intuitivee concepts h of at different lay ers of the hierarch y . Generativ e mo dels hold the promise to they need to understand, and the abilit ability y to reason about these concepts in the pro vide AI systems with a framew ork for all of the many differen t intuitiv e concepts face of uncertain uncertaintty. We hop hopee that our readers will find new ways to mak makee these they need to understand, and the ability to reason about these concepts in the face of uncertainty. We hope that our 721 readers will find new ways to make these
CHAPTER 20. DEEP GENERATIVE MODELS
approac approaches hes more powerful and con contin tin tinue ue the journey to understanding the principles that underlie learning and in intelligence. telligence. approaches more powerful and continue the journey to understanding the principles that underlie learning and intelligence.
722
Bibliograph Bibliography y Bibliography
Abadi, M., Agarw Agarwal, al, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Da Davis, vis, A., Dean, J., Devin, M., Ghema Ghemaw wat, S., Go Goo odfello dfellow, w, I., Harp, A., Irving, G., Isard, M., Abadi, M., Jozefowicz, Agarwal, A.,R., Barham, E.,M., Chen, Z., berg, Citro,J., C.,Mané, Corrado, S., Davis, Jia, Y., Kaiser,P.,L.,Brevdo, Kudlur, Leven Levenberg, D., G. Monga, R., A., ore, Dean, Devin, M., Olah, Ghema wat, S., Goodfello w, I., Harp, A., Irving, G., Isard, Mo Moore, S.,J., Murra Murray y, D., C., Sch Schuster, uster, M., Shlens, J., Steiner, B., Sutskev Sutskever, er,M., I., Jia, Y., Jozefowicz, Kudlur, M.,an, Leven berg, J.,F.,Mané, D., O., Monga, R., T alw alwar, ar, K., Tuc uck ker, PR., ., VKaiser, anhouc anhouck kL., e, V., Vasudev asudevan, V., Viégas, Viny Vinyals, als, Warden, Mo ore, S.,bMurra y, Wic D., kOlah, M., Shlens, J., Steiner, B., Sutskev er, I., P ., W atten attenb erg, M., Wick e, M.,C., Yu,Sch Y.,uster, and Zheng, X. (2015). TensorFlow: Large-scale T alw ar, K., T uc k er, P ., V anhouc k e, V., V asudev an, V., Viégas, F., Viny als, O., W arden, mac machine hine learning on heterogeneous systems. Soft Softw ware av available ailable from tensorflo tensorflow.org. w.org. 25, P., W attenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-scale 212 , 449 machine learning on heterogeneous systems. Software available from tensorflow.org. 25, Ac212 kley kley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for , ,449 Boltzmann mac machines. hines. Co Cognitive gnitive Scienc Sciencee , 9, 147–169. 573, 656 Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for gnitive What Scienceregularized Alain, G. andmac Bengio, auto-enco auto-encoders ders learn from the data Boltzmann hines.Y.Co(2013). , 9, 147–169. 573, 656 generating distribution. In ICLR’2013, arXiv:1211.4246 . 510, 516, 524 Alain, G. and Bengio, Y. (2013). What regularized auto-encoders learn from the data arXiv:1211.4246 Alain, G., Bengio, Y., Yao, L., Éric Thib Thibo odeau-Laufer, Yosinski, and, 524 Vincen Vincent, t, P. (2015). generating distribution. In ICLR’2013, . 510J., , 516 GSNs: Generative stochastic netw networks. orks. arXiv:1503.05571. 513, 715 Alain, G., Bengio, Y., Yao, L., Éric Thibodeau-Laufer, Yosinski, J., and Vincent, P. (2015). Anderson, E. (1935). stochastic The Irisesnetw of the Gaspé Peninsula. Bul Bulletin letin of the Americ American an Iris GSNs: Generative orks. arXiv:1503.05571. 513 , 715 So Society ciety , 59, 2–5. 21 Anderson, E. (1935). The Irises of the Gaspé Peninsula. Bulletin of the American Iris ciety Ba,SoJ., Mnih, Kavukcuoglu, cuoglu, K. (2014). Multiple ob object ject recognition with visual , 59,V., 2–5.and 21 Kavuk atten attention. tion. arXiv:1412.7755 . 693 Ba, J., Mnih, V., and Kavukcuoglu, K. (2014). Multiple ob ject recognition with visual Bac Bachman, hman, and Precup, D.. (2015). Variational generative sto stochastic chastic netw networks orks with atten tion.P.arXiv:1412.7755 693 collab collaborativ orativ orativee shaping. In Pr Pro oceedings of the 32nd International Confer Conferenc enc encee on Machine Bac and Precup, D. ariational generative stochastic netw Lehman, arning,P.ICML 2015, Lil Lille, le,(2015). Franc ance, e, V 6-11 July 2015 , pages 1964–1972. 718 orks with collaborative shaping. In Proceedings of the 32nd International Conference on Machine Learning, 2015, Lil le, FrJ., ancand e, 6-11 July D. 2015 Bacon, P.-L.,ICML Bengio, E., Pineau, Precup, (2015). in , pagesConditional 1964–1972.computation 718 neural net netw works using a decision-theoretic approac approach. h. In 2nd Multidisciplinary Confer Conferenc enc encee Bacon, P.-L.,cBengio, Pineau, and Precup, D. (RLDM (2015). Conditional on Reinfor einforc ement LE., earning andJ., De Decision cision Making 2015) 2015).. 453 computation in 2nd Multidisciplinary Conference neural networks using a decision-theoretic approach. In on Reinfor cement Learning and cision Making (RLDM 2015).co Bagnell, J. A. and Bradley Bradley, , D. M.De (2009). Differen Differentiable tiable sparse coding. ding. In D. Koller, 453 D. Sch Schuurmans, uurmans, Y. Bengio, and L. Bottou, editors, Advanc dvances es in Neur Neural al Information Bagnell, J. A.Systems and Bradley , D. M. ,(2009). Differentiable Pr Pro ocessing 21 (NIPS’08) (NIPS’08), pages 113–120. 501 sparse coding. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS’08) , pages723 113–120. 501
723
BIBLIOGRAPHY
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR’2015, arXiv:1409.0473 . 25, 101, 399, 421, 422, Bahdanau, 468, 478 D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR’2015, arXiv:1409.0473 . 25, 101, 399, 421, 422, Bahl, R., Bro Brown, wn, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition 468,L.478 with con contin tin tinuous-parameter uous-parameter hidden Mark Marko ov mo models. dels. Computer, Sp Speeech and Language , 2, Bahl, L. R.,461 Brown, P., de Souza, P. V., and Mercer, R. L. (1987). Speech recognition 219–234. with continuous-parameter hidden Markov models. Computer, Speech and Language , 2, Baldi, P. and networks orks and principal comp component onent analysis: 219–234. 461Hornik, K. (1989). Neural netw Learning from examples without lo local cal minima. Neur Neural al Networks Networks,, 2, 53–58. 286 Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: al Networks Baldi, P., Brunak, S., Frasconi, P.,loSoda, G., andNeur Pollastri, G. (1999). Exploiting Learning from examples without cal minima. , 2, 53–58. 286 the past and the future in protein secondary structure prediction. Bioinformatics , 15 (11), 15(11), Baldi, P ., Brunak, S., F rasconi, P ., Soda, G., and Pollastri, G. (1999). Exploiting the 937–946. 396 Bioinformatics 15 past and the future in protein secondary structure prediction. , (11), Baldi, P P., ., Sado Sadowski, P., ., and Whiteson, D. (2014). Searc Searching hing for exotic particles in 937–946. 396 wski, P high-energy ph physics ysics with deep learning. Natur Naturee communic ommunications ations ations,, 5. 26 Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in Natur ations , 5vision Ballard, D. H.,ph Hinton, G. E., and Sejnowski, T.eJ.communic (1983). Parallel high-energy ysics with deep learning. . 26 computation. Natur Naturee . 455 Ballard, D. H., Hinton, G. E., and Sejnowski, T. J. (1983). Parallel vision computation. Natur Barlo Barlow, w, eH. B. (1989). Unsup Unsupervised ervised learning. Neur Neural al Computation Computation,, 1, 295–311. 146 . 455 al Computation Barron, Universal ersal appro approximation ximationNeur bounds for sup superp erp erpositions ositions of a sigmoidal Barlow, A. H. E. B. (1993). (1989). Univ Unsup ervised learning. , 1, 295–311. 146 function. IEEE Trans. on Information The Theory ory ory,, 39, 930–945. 198 Barron, A. E. (1993). Universal approximation bounds for superpositions of a sigmoidal rans. onLatent Information ory 39, 930–945. Bartholomew, D. J.T(1987). variableThe mo models dels, and factor analysis analysis. Universit ersit ersity y function. IEEE 198 . Oxford Univ Press. 493 Bartholomew, D. J. (1987). Latent variable models and factor analysis . Oxford University Basilevsky Basilevsky, , A. (1994). Statistic Statistical al Factor Analysis and Relate elated d Metho Methods: ds: The Theory ory and Press. 493 Applic Applications ations . Wiley Wiley.. 493 Basilevsky, A. (1994). Statistical Factor Analysis and Related Methods: Theory and Applications Bastien, F., Lam Lamblin, blin,. P ., P Pascan ascan ascanu, u, R., Bergstra, J., Goo Goodfellow, dfellow, I. J., Bergeron, A., . Wiley 493 Bouc Bouchard, hard, N., and Bengio, Y. (2012). Theano: new features and sp speed eed improv improvemen emen ements. ts. Bastien, F., Lamand blin,Unsupervised P., Pascanu,FR., Bergstra, J.,NIPS Goodfellow, I. J., Bergeron, A.,, Deep Learning eature Learning 2012 Workshop. 25, 82, 212 Bouc hard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. 222 , 449 Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop. 25, 82, 212, Basu, and Christensen, J. (2013). T Teaching eaching classification boundaries to humans. In 222,S. 449 AAAI’2013 . 329 Basu, S. and Christensen, J. (2013). Teaching classification boundaries to humans. In AAAI’2013 Baxter, J. (1995). Learning internal represen representations. tations. In Pr Pro oceedings of the 8th International . 329 Confer Conferenc enc encee on Computational Learning The Theory ory (COL (COLT’95) T’95) T’95),, pages 311–320, Santa Cruz, Baxter, J. (1995). Learning internal representations. In Proceedings of the 8th International California. ACM Press. 246 Conference on Computational Learning Theory (COLT’95), pages 311–320, Santa Cruz, Ba C. (2014). Learning sto Bay yer, J. andAOsendorfer, stochastic chastic recurrent netw networks. orks. ArXiv California. CM Press. 246 e-prints e-prints.. 264 Bayer, J. and Osendorfer, C. (2014). Learning stochastic recurrent networks. ArXiv e-prints Bec Beck ker, S. .and network ork that discov discovers ers surfaces 264Hinton, G. (1992). A self-organizing neural netw in random-dot stereograms. Natur Naturee , 355, 161–163. 544 Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature , 355724 , 161–163. 544
BIBLIOGRAPHY
Behnk Behnke, e, S. (2001). Learning iterative image reconstruction in the neural abstraction pyramid. Int. J. Computational Intel Intelligenc ligenc ligencee and Applic Applications ations ations,, 1(4), 427–438. 518 Behnke, S. (2001). Learning iterative image reconstruction in the neural abstraction Beiu, V., Quin J. M., and Avedillo, M. J.e(2003). VLSIations implementations of threshold Int. J. Computational Intelligenc and Applic Quintana, tana, pyramid. , 1(4), 427–438. 518 logic-a comprehensiv comprehensivee survey survey.. Neur Neural al Networks, IEEE Transactions on on,, 14(5), 1217– Beiu, V.,454 Quintana, J. M., and Avedillo, M. J. (2003). VLSI implementations of threshold 1243. logic-a comprehensive survey. Neural Networks, IEEE Transactions on , 14(5), 1217– Belkin, Niyogi, ogi, P P.. (2002). Laplacian eigenmaps and sp spectral ectral tec techniques hniques for 1243. M. 454and Niy em emb bedding and clustering. In T. Dietterich, S. Beck Becker, er, and Z. Ghahramani, editors, Belkin, M. Niyal ogi, P. (2002). Pr Laplacian eigenmaps and spectral techniquesMA. for Advanc dvances es and in Neur Neural Information Pro ocessing Systems 14 (NIPS’01) (NIPS’01), , Cambridge, em b edding and clustering. In T. Dietterich, S. Beck er, and Z. Ghahramani, editors, MIT Press. 244 Advances in Neural Information Processing Systems 14 (NIPS’01), Cambridge, MA. Belkin, M. and244 Niy Niyogi, ogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and MIT Press. data represen representation. tation. Neur Neural al Computation Computation,, 15(6), 1373–1396. 163, 521 Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and al Computation Bengio, E., Bacon, P.-L.,Neur Pineau, J., and Precup, D. (2015a). Conditional data represen tation. , 15(6), 1373–1396. 163, 521 computation in neural net netw works for faster models. arXiv:1511.06297. 453 Bengio, E., Bacon, P.-L., Pineau, J., and Precup, D. (2015a). Conditional computation in Bengio, Bengio, Y. models. (2000a).arXiv:1511.06297. Taking on the curse jointt neural S. netand works for faster 453 of dimensionality in join distributions using neural netw networks. orks. IEEE Transactions on Neur Neural al Networks, sp speecial Bengio, S. Data and Bengio, Y. (2000a). aking on ,the curse of dimensionality in joint issue on Mining and Know Knowle le ledge dgeTDisc Discovery overy overy, 11(3), 550–557. 709 distributions using neural networks. IEEE Transactions on Neural Networks, special issue on Mining Know overy ,N. 11(2015b). Bengio, S., Data Viny Vinyals, als, O., and Jaitly Jaitly, , N.,ledge and Disc Shazeer, Scheduled sampling for (3), 550–557. 709 sequence prediction with recurren recurrentt neural net netw works. Technical rep report, ort, arXiv:1506.03099. Bengio, 385 S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015b). Scheduled sampling for sequence prediction with recurrent neural networks. Technical report, arXiv:1506.03099. Bengio, Neural al Networks and their Applic Application ation to Se Sequenc quenc quencee Recognition gnition.. 385 Y. (1991). Artificial Neur Ph.D. thesis, McGill Univ Universit ersit ersity y, (Computer Science), Mon Montreal, treal, Canada. 409 Bengio, Y. (1991). Artificial Neural Networks and their Application to Sequence Recognition . Bengio, (2000). Gradient-based of hyperparameters. Neur Neural al Computation Computation, , Ph.D.Y. thesis, McGill University, optimization (Computer Science), Montreal, Canada. 409 12 12(8), (8), 1889–1900. 438 Bengio, Y. (2000). Gradient-based optimization of hyperparameters. Neural Computation , 12(8),Y.1889–1900. Bengio, (2002). New models. dels. Technical Report 1215, 438distributed probabilistic language mo Dept. IR IRO, O, Univ Université ersité de Mon Montréal. tréal. 470 Bengio, Y. (2002). New distributed probabilistic language models. Technical Report 1215, Bengio, (2009). Learning de deep eptréal. ar archite chite chitectur ctur ctures es for AI . No Now w Publishers. 200, 626 Dept. Y. IRO, Université de Mon 470 arninglearning deep archite ctures for AI . No Bengio, Y. (2009). (2013). LeDeep of represen representations: tations: lo looking oking forward.200In, 626 Statistic Statistical al Bengio, Y. w Publishers. Language and Sp Speeech Pr Pro ocessing essing,, volume 7978 of Lectur cturee Notes in Computer Scienc Sciencee , Bengio, Y. (2013). Deep learning tations: looking forward. 451 In Statistical pages 1–37. Springer, also in arXivofatrepresen http://arxiv.org/abs/1305.0445. Language and Speech Processing , volume 7978 of Lecture Notes in Computer Science , Bengio, (2015). Early also inference in energy-based mo models dels approximates bac back-propagation. k-propagation. pages Y. 1–37. Springer, in arXiv at http://arxiv.org/abs/1305.0445. 451 Technical Rep Report ort arXiv:1510.02777, Universite de Montreal. 658 Bengio, Y. (2015). Early inference in energy-based models approximates back-propagation. Bengio, Y. and S. (2000b). Modeling high-dimensional discrete data with multiTechnical RepBengio, ort arXiv:1510.02777, Universite de Montreal. 658 la lay yer neural net netw works. In NIPS 12 , pages 400–406. MIT Press. 707, 709, 710, 712 Bengio, Y. and Bengio, S. (2000b). Modeling high-dimensional discrete data with multi12 ,Justifying Bengio, Y. andnet Delalleau, (2009). and generalizing contrastiv trastive div divergence. layer neural works. InO.NIPS pages 400–406. MIT Press. con 707trastiv , 709, e710 , ergence. 712 Neur Neural al Computation Computation,, 21(6), 1601–1621. 516, 614 Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural Computation , 21(6), 1601–1621. 725 516, 614
BIBLIOGRAPHY
Bengio, Y. and Grandv Grandvalet, alet, Y. (2004). No unbiased estimator of the variance of k-fold cross-v cross-validation. alidation. In S. Thrun, L. Saul, and B. Schölk Schölkopf, opf, editors, Advanc dvances es in Neur Neural al Bengio, Y. andPr Grandv alet, Y. (2004). No unbiased estimator the vPress, ariance of k-fold Information Pro ocessing Systems 16 (NIPS’03) (NIPS’03), , Cam Cambridge, bridge, MA.ofMIT Cambridge. cross-validation. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural 122 Information Processing Systems 16 (NIPS’03), Cambridge, MA. MIT Press, Cambridge. Bengio, towards ards AI. In Lar arge ge Sc Scale ale 122 Y. and LeCun, Y. (2007). Scaling learning algorithms tow Kernel Machines . 19 Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In Large Scale Bengio, and Monp Monperrus, errus, M. (2005). Non-lo Non-local cal manifold tangent learning. In L. Saul, KernelY.Machines . 19 Y. Weiss, and L. Bottou, editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems Bengio, Y. and Monp errus, M. (2005). Non-lo cal manifold tangent learning. In L. Saul, 17 (NIPS’04) (NIPS’04),, pages 129–136. MIT Press. 159, 522 A dvanc es in Neur al Information Pr o c essing Systems Y. Weiss, and L. Bottou, editors, Bengio, Y. and , Sénécal, J.-S. (2003). Quic Quick k training of probabilistic neural nets by 17 (NIPS’04) pages 129–136. MIT Press. 159, 522 imp importance ortance sampling. In Pr Pro oceedings of AIST AISTA ATS 2003 . 473 Bengio, Y. and Sénécal, J.-S. (2003). Quick training of probabilistic neural nets by Bengio, Y. andsampling. Sénécal, J.-S. (2008). Adaptiv Adaptive e imp importance ortance to accelerate training oceedings of AIST ATS 2003sampling importance In Pr . 473 of a neural probabilistic language mo model. del. IEEE Trans. Neur Neural al Networks , 19(4), 713–722. Bengio, 473 Y. and Sénécal, J.-S. (2008). Adaptive importance sampling to accelerate training of a neural probabilistic language model. IEEE Trans. Neural Networks , 19(4), 713–722. Bengio, motivated ated 473 Y., De Mori, R., Flammia, G., and Kompe, R. (1991). Phonetically motiv acoustic parameters for con contin tin tinuous uous sp speech eech recognition using artificial neural netw networks. orks. Bengio, De Mori, R., Flammia, and Kompe, R. (1991). Phonetically motivated In Pr Pro oY., ceedings of Eur EuroSp oSp oSpe eech’91 . G., 27, 462 acoustic parameters for continuous speech recognition using artificial neural networks. ceedings of Eur eech’91 . G., Bengio, De Mori, R.,oSp Flammia, ompe, e, R. (1992). Neural netw network-Gaussian ork-Gaussian In ProY., 27,and 462 Komp mixture hybrid for sp speech eech recognition or density estimation. In NIPS 4 , pages 175–182. Bengio, Y.,Kaufmann. De Mori, R., Morgan 462Flammia, G., and Kompe, R. (1992). Neural network-Gaussian mixture hybrid for speech recognition or density estimation. In NIPS 4 , pages 175–182. Bengio, Y.,Kaufmann. Frasconi, P ., and Simard, P. (1993). The problem of learning long-term Morgan 462 dep dependencies endencies in recurrent netw networks. orks. In IEEE International Confer Conferenc enc encee on Neur Neural al Bengio, Y.,, F rasconi, P., and San Simard, P. (1993). problem learning long-term Networks Networks, pages 1183–1195, Francisco. IEEE The Press. (in (invited vitedofpap paper). er). 405 dependencies in recurrent networks. In IEEE International Conference on Neural Bengio, Y., ,Simard, P., and Frasconi, P. (1994). Learning dependencies with Networks pages 1183–1195, San Francisco. IEEE Press. long-term (invited pap er). 405 gradien gradientt descen descentt is difficult. IEEE Tr. Neur Neural al Nets . 18, 403, 405, 406, 414 Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning IEEE Tr. Neur al Nets .Gradien gradien t descen t is difficult. 18, 403t-based , 405, 406 , 414 of hyperparameters. Learning Conference, Snowbird. 438 Bengio, Y., Latendresse, S., and Dugas, C. (1999). Gradient-based learning of hyperBengio, Y., Ducharme, and Vincent, P. (2001). parameters. LearningR., Conference, Snowbird. 438A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT Bengio, R.,, and Press.Y., 18,Ducharme, 450, 466, 469 475,Vincent, 480, 485P. (2001). A neural probabilistic language model. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, NIPS’2000 , pages 932–938. MIT Bengio, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic Press. Y., 18, Ducharme, 450, 466, 469 , 475 , 480, 485 language model. JMLR, 3, 1137–1155. 469, 475 Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic JMLR 3, 1137–1155. Bengio, Y., model. Le Roux, N., ,Vincen Vincent, t, P., Delalleau, Convex ex language 469, 475O., and Marcotte, P. (2006a). Conv neural net netw works. In NIPS’2005 , pages 123–130. 257 Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., and Marcotte, P. (2006a). Convex Bengio, and Le Roux, (2006b). 257 The curse of highly variable functions NIPS’2005 neuralY., netDelalleau, works. InO., , pagesN.123–130. for local kernel mac machines. hines. In NIPS’2005 . 157 Bengio, Y., Delalleau, O., and Le Roux, N. (2006b). The curse of highly variable functions for local kernel machines. In NIPS’2005726 . 157
BIBLIOGRAPHY
Bengio, Y., Laro Larocchelle, H., and Vincen Vincent, t, P. (2006c). Non-local manifold Parzen windows. In NIPS’2005 . MIT Press. 159, 523 Bengio, Y., Larochelle, H., and Vincent, P. (2006c). Non-local manifold Parzen windows. Bengio, Y., Lam Lamblin, ., Popovici, D., and Laro Larochelle, chelle, H. (2007). Greedy lay layer-wise er-wise In NIPS’2005 . blin, MIT P Press. 159, 523 training of deep net netw works. In NIPS’2006 . 14, 19, 200, 324, 325, 531, 533 Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise Bengio, Y.,ofLouradour, Collob Collobert, ert, R., and W,eston, J., (2009). training deep netwJ., orks. In NIPS’2006 . 14 19, 200 324, 325Curriculum , 531, 533 learning. In ICML’09 . 329 Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In ICML’09 Bengio, Y.,. Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep 329 represen representations. tations. In ICML’2013 . 607 Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013a). Better mixing via deep ICML’2013 Bengio, Y.,tations. Léonard, and Courville, gradients ts represen In N., . 607 A. (2013b). Estimating or propagating gradien through sto stochastic chastic neurons for conditional computation. arXiv:1308.3432. 451, 453, Bengio, Y., Léonard, N., and Courville, A. (2013b). Estimating or propagating gradients 691, 693 through stochastic neurons for conditional computation. arXiv:1308.3432. 451, 453, Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013c). Generalized denoising auto691, 693 enco encoders ders as generativ generativee models. In NIPS’2013 . 510, 713, 715 Bengio, Y., Yao, L., Alain, G., and Vincent, P. (2013c). Generalized denoising autoBengio, Y., Courville, A.,e models. and Vincent, P. (2013d). Represen Representation encoders as generativ In NIPS’2013 . 510 , 713, tation 715 learning: A review and new persp erspectives. ectives. IEEE Trans. Pattern Analysis and Machine Intel Intelligenc ligenc ligencee (P (PAMI) AMI) AMI),, Bengio, Y., Courville, A., and Vincent, P . (2013d). Represen tation learning: A review and 35 35(8), (8), 1798–1828. 558 IEEE T r ans. Pattern Analysis and Machine Intel ligenc e (P AMI) new perspectives. , 35(8),Y., Bengio, Thib Thibo odeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative 1798–1828. 558 sto stocchastic net netw works trainable by backprop. In ICML’2014 . 713, 714, 715, 716, 717 Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative ICML’2014 Bennett, C. (1976). Efficien Efficient t estimation of freeIn energy differences Monte Carlo data. stochastic networks trainable by backprop. . 713from , 714Mon , 715te , 716 , 717 Journal of Computational Physics Physics,, 22(2), 245–268. 632 Bennett, C. (1976). Efficient estimation of free energy differences from Monte Carlo data. JournalJ.ofand Computational Physics The Bennett, Lanning, S. (2007). Netflix prize. 632 482 , 22(2), 245–268. Berger, Pietra, J., and Pietra, A. (1996). A maximum entrop entropy y Bennett,A.J.L., andDella Lanning, S. V. (2007). TheDella Netflix prize.S. 482 approac approach h to natural language pro processing. cessing. Computational Linguistics , 22, 39–71. 476 Berger, A. L., Della Pietra, V. J., and Della Pietra, S. A. (1996). A maximum entropy Computational Linguistics Berglund, Raik Raiko, o, T. (2013). Sto Stochastic chastic gradient estimate variance contrastiv contrastive approachM.toand natural language processing. , 22in , 39–71. 476e div divergence ergence and persistent con contrastiv trastiv trastivee div divergence. ergence. CoRR, abs/1312.6002. 617 Berglund, M. and Raiko, T. (2013). Stochastic gradient estimate variance in contrastive Bergstra, J. (2011). Inc Incorp orp orpor or orating ating Complex Cel Cells ls CoRR into Neur Neural al Networks .for divergence and persistent con trastiv e divergence. , abs/1312.6002 617Pattern Classific Classification ation . Ph.D. thesis, Univ Université ersité de Mon Montréal. tréal. 254 Bergstra, J. (2011). Incorporating Complex Cel ls into Neural Networks for Pattern ClassificJ. ation Bergstra, and. Bengio, Y. (2009). Slow,de decorrelated features for pretraining complex Ph.D. thesis, Université Montréal. 254 cell-lik cell-likee net netw works. In NIPS’2009 . 497 Bergstra, J. and Bengio, Y. (2009). Slow, decorrelated features for pretraining complex Bergstra, and Bengio, Y. (2012). . Random search for hyper-parameter optimization. J. cell-likeJ.net works. In NIPS’2009 497 Machine Learning Res. es.,, 13, 281–305. 437, 438 Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. MachineJ.,LeBreuleux, arning Res. Bergstra, O.,, 13 Bastien, F., Lam Lamblin, Pascanu, u, R., Desjardins, G., Turian, , 281–305. 437blin, , 438 P., Pascan J., Warde-F arde-Farley arley arley,, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression Bergstra, J., In Breuleux, O., Bastien, compiler. Pr Pro oc. SciPy . 25, 82, F., 212Lam , 222blin, , 449P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression 727 compiler. In Proc. SciPy . 25, 82, 212, 222 , 449
BIBLIOGRAPHY
Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyp yper-parameter er-parameter optimization. In NIPS’2011 . 439 Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter Berk Berkes, es, P. and Wiskott, L. (2005). Slow feature analysis yields a rich rep repertoire ertoire of complex optimization. In NIPS’2011 . 439 cell properties. Journal of Vision , 5(6), 579–602. 498 Berkes, P. and Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex Journal of Vision Bertsek Bertsekas, as, D. P. and Tsitsiklis, J. (1996). Neur Neuro-Dynamic o-Dynamic Pro ogr gramming amming amming.. Athena Scientific. cell properties. , 5(6), 579–602. 498 Pr 106 Bertsekas, D. P. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming . Athena Scientific. Besag, 106 J. (1975). Statistical analysis of non-lattice data. The Statistician , 24(3), 179–195. 618 Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician , 24(3), 179–195. Bishop, networks. orks. 188 618 C. M. (1994). Mixture density netw Bishop, C. M. M. (1995a). Regularization complexity feed-forward ard netw networks. orks. Bishop, C. (1994). Mixture density and netwcomplexit orks. 188 y control in feed-forw In Pr Pro oceedings International Confer Conferenc enc encee on Artificial Neur Neural al Networks ICANN’95 , Bishop, C.1,M. (1995a). Regularization volume page 141–148. 242, 249 and complexity control in feed-forward networks. In Proceedings International Conference on Artificial Neural Networks ICANN’95 , Bishop, (1995b). Training equivalent alent to Tikhonov regularization. volumeC.1,M. page 141–148. 242, 249with noise is equiv Neur Neural al Computation Computation,, 7(1), 108–116. 242 Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. NeuralC.Computation Bishop, M. (2006). Pattern Recognition arning.. Springer. 98, 145 , 7(1), 108–116. 242 and Machine Learning Pattern Recognition and Machine earning Blum, and Riv Rivest, est, R. L. (1992). Training a 3-no 3-node de L neural netw network ork is NP-complete. Bishop,A.C.L.M. (2006). . Springer. 98, 145 293 Blum, A. L. and Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Blumer, Ehrenfeucht, t, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and 293 A., Ehrenfeuch the Vapnik–Cherv Vapnik–Chervonenkis onenkis dimension. Journal of the ACM , 36(4), 929––865. 114 Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the ACM ,à36 Bonnet, G. (1964). T ransformations desJournal signauxofaléatoires trav travers ers929––865. les systèmes the Vapnik–Cherv onenkis dimension. (4), 114 non linéaires sans mémoire. Annales des Télé Téléccommunic ommunications ations , 19(9–10), 203–220. 691 Bonnet, G. (1964). Transformations des signaux aléatoires à travers les systèmes non Annales communic Bordes, A.,sans W Weston, eston, J., Collob Collobert, ert,desR.,Télé and Bengio,ations Y. (2011). Learning structured linéaires mémoire. , 19(9–10), 203–220. 691 em emb beddings of kno knowledge wledge bases. In AAAI 2011 . 487 Bordes, A., Weston, J., Collobert, R., and Bengio, Y. (2011). Learning structured AAAI 2011Y. Bordes, A., Glorot, Weston, J.,In and Bengio, (2012). Joint learning of words and embeddings of knoX., wledge bases. . 487 meaning represen representations tations for open-text semantic parsing. AIST AISTA ATS’2012 . 403, 487 Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and ATS’2012 Bordes, A., represen Glorot, X., Weston, J., and Bengio, Y. parsing. (2013a). AIST A seman semantic tic matc matching hing, 487 energy meaning tations for open-text semantic . 403 function for learning with multi-relational data. Machine Learning: Sp Speecial Issue on Bordes, A., Semantics Glorot, X.,. W eston, J., and Bengio, Y. (2013a). A semantic matching energy Learning 486 function for learning with multi-relational data. Machine Learning: Special Issue on Learning Bordes, A., Semantics Usunier, .N., 486Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013b). Translating embeddings for modeling multi-relational data. In C. Burges, L. Bottou, Bordes, A., Usunier, N., Garcia-Duran, A., Weston, and Y O. (2013b). M. Welling, Z. Ghahramani, and K. Weinberger, editors,J.,Advanc dvances esakhnenko, in Neur Neural al Information T ranslating embeddings for modeling multi-relational data. In C. Burges, L. Bottou, Pr Pro ocessing Systems 26 , pages 2787–2795. Curran Asso Associates, ciates, Inc. 487 A dvanc es in Neur al Information M. Welling, Z. Ghahramani, and K. Weinberger, editors, Processing Systems 26 , pagesY.2787–2795. Bornsc Bornschein, hein, J. and Bengio, (2015). Curran Rew Reweigh eigh eighted tedciates, wak wake-sleep. e-sleep. Asso Inc. 487 In ICLR’2015, arXiv:1406.2751 . 695 Bornschein, J. and Bengio, Y. (2015). Reweighted wake-sleep. In ICLR’2015, 728 arXiv:1406.2751 . 695
BIBLIOGRAPHY
Bornsc Bornschein, hein, J., Shabanian, S., Fisc Fischer, her, A., and Bengio, Y. (2015). Training bidirectional Helmholtz mac machines. hines. Technical rep report, ort, arXiv:1506.03877. 695 Bornschein, J., Shabanian, S., Fischer, A., and Bengio, Y. (2015). Training bidirectional Boser, B. E., mac Guyon, and V apnik, V. N. (1992). A training algorithm for optiHelmholtz hines.I.TM., echnical rep ort, arXiv:1506.03877. 695 mal margin classifiers. In COL COLT T ’92: Pr Pro oceedings of the fifth annual workshop on Boser, B. E., Guyon, I. M., and apnik, V. N. (1992). A training algorithm for optiComputational le learning arning the theory ory ory,, V pages 144–152, New York, NY, USA. ACM. 18 , 140 mal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Computational arning algorithms theory , pages Bottou, L. (1998).leOnline and144–152, sto stochastic chastic approximations. In ACM. D. Saad, New York, NY, USA. 18,editor, 140 Online Learning in Neur Neural al Networks Networks.. Cambridge Universit University y Press, Cam Cambridge, bridge, UK. 296 Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad, editor, OnlineL. Learning Networks Bottou, (2011).in FNeur romalmac machine hine learning to machine Tec echnical hnicalUK. report, . Cambridge Universitreasoning. y Press, Cam bridge, 296 arXiv.1102.1808. 401, 403 Bottou, L. (2011). From machine learning to machine reasoning. Technical report, Bottou, L. (2015). Multilay Multilayer networks. orks. Deep Learning Summer Sc Scho ho hool. ol. 443 arXiv.1102.1808. 401, 403er neural netw Bottou, Bousquet, O.er(2008). of Learning large scaleSummer learning.ScIn NIPS’2008 . Bottou, L. L. and (2015). Multilay neural The netwtradeoffs orks. Deep hool. 443 282, 295 Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In NIPS’2008 . Boulanger-Lew Boulanger-Lewando ando andowski, wski, N., Bengio, Y., and Vincent, P. (2012). Mo Modeling deling temp temporal oral 282, 295 dep dependencies endencies in high-dimensional sequences: Application to polyphonic music generation Boulanger-Lew andowski, N., Bengio, and transcription. In ICML’12 . 688 Y., and Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation Boureau, Y., Ponce, J., LeCun, Y. (2010). A theoretical analysis of feature pooling in ICML’12 and transcription. In and . 688 vision algorithms. In Pr Pro oc. International Confer Conferenc enc encee on Machine le learning arning (ICML’10) (ICML’10).. Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in 346 vision algorithms. In Proc. International Conference on Machine learning (ICML’10) . Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the lo locals: cals: 346 multi-w ulti-wa ay lo local cal po pooling oling for image recognition. In Pr Pro oc. International Confer Conferenc enc encee on Boureau, Y.,Vision Le Roux, N., Bach, F., Ponce, Computer (ICCV’11) (ICCV’11). . IEEE. 346 J., and LeCun, Y. (2011). Ask the locals: multi-way local pooling for image recognition. In Proc. International Conference on Computer (ICCV’11) Bourlard, H.Vision and Kamp, Y. (1988). Auto-association ciation by multila multilayer yer perceptrons and . IEEE. Auto-asso 346 singular value decomp decomposition. osition. Biolo Biologic gic gical al Cyb Cybernetics ernetics , 59, 291–294. 505 Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and BiologicSpeech al Cybernetics 59, 291–294. 505 Bourlard, anddecomp Wellekens, C. (1989). pattern, discrimination and multi-la multi-layered yered singularH. value osition. perceptrons. Computer Sp Speeech and Language , 3, 1–19. 462 Bourlard, H. and Wellekens, C. (1989). Speech pattern discrimination and multi-layered SpeeL. ch (2004). and Language Bo Boyd, S. and VComputer andenberghe, Convex Cambridge bridge Universit University y pyd, erceptrons. , 3,Optimization 1–19. 462 . Cam Press, New York, NY, USA. 93 Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge University Brady Brady, , M. L., Y Raghav Raghavan, an,USA. R., and Slawny y, J. (1989). Bac Back-propagation k-propagation fails to separate Press, New ork, NY, 93 Slawn where perceptrons succeed. IEEE Transactions on Cir Circuits cuits and Systems Systems,, 36, 665–674. Brady , M. L., Raghav an, R., and Slawn y , J. (1989). Bac k-propagation fails to separate 284 where perceptrons succeed. IEEE Transactions on Circuits and Systems , 36, 665–674. Brak Brakel, Stroobandt, obandt, D., and Schrau Schrauwen, wen, B. (2013). Training energy-based mo models dels for 284el, P., Stro 14 time-series imputation. Journal of Machine Learning Rese , , 2771–2797. 676, esear ar arch ch Brak el, P ., Stro obandt, D., and Schrau wen, B. (2013). T raining energy-based mo dels for 700 14 Journal of Machine L e arning R ese ar ch time-series imputation. , , 2771–2797. 676, Brand, 700 M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 163, 521 Brand, M. (2003). Charting a manifold. In NIPS’2002 , pages 961–968. MIT Press. 163, 729 521
BIBLIOGRAPHY
Breiman, L. (1994). Bagging predictors. Machine Learning , 24(2), 123–140. 255 Learning Breiman, L., Friedman, J. H.,predictors. Olshen, R. Machine A., and Stone, C. ,J.24(1984). Classific Classification ation and L. (1994). Bagging (2), 123–140. 255 Regr gression ession Trees es.. Wadsworth International Group, Belmont, CA. 145 Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees . W Bridle, J. S. (1990). Alphanets: recurrent ‘neural’ network work architecture adsworth aInternational Group,net Belmont, CA. 145 with a hidden Mark Marko ov model interpretation. Sp Speeech Communic Communication ation ation,, 9(1), 83–92. 185 Bridle, J. S. (1990). Alphanets: a recurrent ‘neural’ network architecture with a hidden SpS., eechHelmstaedter, Communication Briggman, K., Denk, W., Seung, M., 9 N., Turaga, Markov model interpretation. (1),and 83–92. 185 S. C. (2009). Maximin affinity learning of image segmentation. In NIPS’2009 , pages 1865–1873. 360 Briggman, K., Denk, W., Seung, S., Helmstaedter, M. N., and Turaga, S. C. (2009). NIPS’2009 Bro Brown, wn, P. F., Co Cock ck cke, e, J., Pietra, S. A. D., Pietra, V. D., Jelinek, F., 1865–1873. Lafferty Lafferty,, J. 360 D., Maximin affinity learning of image segmentation. In J. , pages Mercer, R. L., and Ro Roossin, ossin, P. S. (1990). A statistical approac approach h to mac machine hine translation. Bro wn, P. F., Cock e, J., Pietra, S. 79–85. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Computational linguistics , 16(2), 21 Mercer, R. L., and Roossin, P. S. (1990). A statistical approach to machine translation. Computational linguistics Bro Brown, wn, P. F., Pietra, V. J. D., DeSouza, P. V., , 16 (2), 79–85. 21 Lai, J. C., and Mercer, R. L. (1992). Classbased n-gram mo of natural language. Computational Linguistics , 18, 467–479. models dels Bro wn, P . F., Pietra, V. J. D., DeSouza, P . V., Lai, J. C., and Mercer, R. L. (1992). Class466 n Computational Linguistics , 18, 467–479. based -gram models of natural language. Bryson, Applied d optimal contr ontrol: ol: optimization, estimation, and 466 A. and Ho, Y. (1969). Applie contr ontrol ol . Blaisdell Pub. Co. 225 Bryson, A. and Ho, Y. (1969). Applied optimal control: optimization, estimation, and controlJr., Bryson, A. E. and method d for solving . Blaisdell Pub.Denham, Co. 225 W. F. (1961). A steepest-ascent metho optim optimum um programming problems. Technical Rep Report ort BR-1303, Raytheon Company Company,, Bryson, E. and Denham, Missle Jr., and A. Space Division. 225 W. F. (1961). A steepest-ascent method for solving optimum programming problems. Technical Report BR-1303, Raytheon Company, a, and Buciluˇ a, C., Caruana, R., and Model del compression. In Missle Space Division. 225Niculescu-Mizil, A. (2006). Mo Pr Pro oceedings of the 12th ACM SIGKDD international confer onferenc enc encee on Know Knowle le ledge dge disc discovery overy Buciluˇ a, C., Caruana, R., and Niculescu-Mizil, A. (2006). Mo del compression. In and data mining , pages 535–541. ACM. 451 Proceedings of the 12th ACM SIGKDD international conference on Know ledge discovery and data mining ,R., Burda, Y., Grosse, and 535–541. Salakhutdino Salakhutdinov, v, R. Importance ortance weigh eighted ted auto autoenco enco encoders. ders. pages ACM. 451(2015). Imp arXiv pr preprint eprint arXiv:1509.00519 . 700 Burda, Y., Grosse, R., and Salakhutdinov, R. (2015). Importance weighted autoencoders. arXiv eprint arXiv:1509.00519 Cai, M., pr Shi, Y., and Liu, J. (2013).. Deep networks orks for sp speec eec eech h recognition. 700 maxout neural netw In Automatic Sp Speeech Recognition and Understanding (ASR (ASRU), U), 2013 IEEE Workshop Cai, Shi, 291–296. Y., and Liu, J. (2013). Deep maxout neural networks for speech recognition. on on,M., , pages IEEE. 193 In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on , pages Carreira-P Carreira-Perpiñan, erpiñan, M.IEEE. A. and193 Hinton, G. E. (2005). On contrastiv contrastivee divergence learning. 291–296. In R. G. Co Cow well and Z. Ghahramani, editors, Pr Pro oceedings of the Tenth International Carreira-P erpiñan, M. A. Intel and ligenc Hinton, G. E. (2005).(AIST On contrastiv divergence Workshop on Artificial Intelligenc ligence e and Statistics (AISTA ATS’05) TS’05),,epages 33–40.learning. So Society ciety Pr o c e e dings of the T enth International In R. G. CowIn elltelligence and Z. Ghahramani, for Artificial Intelligence and Statis Statistics. tics.editors, 614 Workshop on Artificial Intel ligence and Statistics (AISTATS’05), pages 33–40. Society Caruana, R. (1993). Multitask learning. In Pr Pro oc. 1993 Conne Connectionist ctionist Mo Models dels for Artificial Intelligence andconnectionist Statistics. 614 Summer Scho School ol , pages 372–379. 245 Caruana, R. (1993). Multitask connectionist learning. In Proc. 1993 Connectionist Models Summer ol , pages Cauc Cauch hy, A. Scho (1847). Métho Méthode de générale simulul372–379. 245pour la résolution de systèmes d’équations sim tanées. In Compte rendu des sé séanc anc ances es de l’ac l’académie adémie des scienc sciences es es,, pages 536–538. 83, Cauc 224hy, A. (1847). Méthode générale pour la résolution de systèmes d’équations simultanées. In Compte rendu des séances de l’académie des sciences , pages 536–538. 83, 730 224
BIBLIOGRAPHY
Ca Cayton, yton, L. (2005). Algorithms for manifold learning. Technical Rep Report ort CS2008-0923, UCSD. 163 Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A surv survey ey ey.. ACM UCSD. 163 computing surveys (CSUR) (CSUR),, 41(3), 15. 102 Chandola, V., Banerjee, A., and Kumar, V. (2009). Anomaly detection: A survey. ACM computing 41(3),opf, Chap Chapelle, elle, O.,surveys Weston,(CSUR) J., and, Schölk Schölkopf, semi-supervised ervised 15. B. 102(2003). Cluster kernels for semi-sup learning. In S. Beck Becker, er, S. Thrun, and K. Ob Obermay ermay ermayer, er, editors, Advanc dvances es in Neur Neural al Chap elle, O., W eston, J., and Schölk B. (2003). Cluster kernels for semi-sup ervised Information Pr Pro ocessing Systems 15opf, (NIPS’02) (NIPS’02), , pages 585–592, Cambridge, MA. MIT learning. Press. 244 In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15 (NIPS’02) , pages 585–592, Cambridge, MA. MIT Chap Chapelle, elle, 244 O., Schölk Schölkopf, opf, B., and Zien, A., editors (2006). Semi-Sup Semi-Supervise ervise ervised d Learning arning.. MIT Press. Press, Cam Cambridge, bridge, MA. 244, 544 Chapelle, O., Schölkopf, B., and Zien, A., editors (2006). Semi-Supervised Learning . MIT Chellapilla, K., Puri, MA. S., and Conv volutional Neural Press, Cam bridge, 244Simard, , 544 P. (2006). High Performance Con Net Netw works for Document Processing. In Guy Lorette, editor, Tenth International Chellapilla, S., and Simard, P. (2006). High P erformance Con volutional Neural WorkshopK., on Puri, Frontiers in Handwriting Recognition gnition, , La Baule (F (France). rance). Université de T enth International Net w orks for Document Processing. In Guy Lorette, editor, Rennes 1, Suvisoft. http://www.suvisoft.com. 24, 27, 448 Workshop on Frontiers in Handwriting Recognition , La Baule (France). Université de Chen, B., Ting, J.-A., Marlin, B. M., and de Freitas, inv variant Rennes 1, Suvisoft. http://www.suvisoft.com. 24,N. 27(2010). , 448 Deep learning of in spatio-temp spatio-temporal oral features from video. NIPS*2010 Deep Learning and Unsup Unsupervised ervised Chen, B., Ting, J.-A.,WMarlin, M., and de Freitas, N. (2010). Deep learning of invariant Feature Learning orkshop.B.361 spatio-temporal features from video. NIPS*2010 Deep Learning and Unsupervised Chen, S. F.Learning and Go Goo oW dman, J. T.361 (1999). An empirical study of smoothing tec techniques hniques for Feature orkshop. language modeling. Computer, Sp Speeech and Language anguage,, 13(4), 359–393. 465, 476 Chen, S. F. and Goodman, J. T. (1999). An empirical study of smoothing techniques for Computer, SpW eech andChen, Language Chen, T., Du, Z., Sun, N., Wang, J., u, C., Y., and emam, O. (2014a). DianNao: language modeling. , 13T(4), 359–393. 465, 476 A small-fo small-footprin otprin otprintt high-throughput accelerator for ubiquitous machine-learning. In Pr ProoChen, T., Du, Z., Sun, N., W ang, J., W u, C., Chen, Y., and T emam, O. (2014a). DianNao: ceedings of the 19th international confer onferenc enc encee on Ar Archite chite chitectur ctur ctural al supp support ort for pr pro ogr gramming amming A small-footprin ter high-throughput accelerator for ACM. ubiquitous languages and op oper erating ating systems systems,, pages 269–284. 454 machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages erating systems Chen, T., Li,and M.,opLi, Y., Lin, M.,, W ang, 269–284. N., Wang, M., Xiao, pages ACM. 454 T., Xu, B., Zhang, C., and Zhang, Z. (2015). MXNet: A flexible and efficient machine learning library for Chen, T., Li, M.,distributed Li, Y., Lin, M., Wang, Wang,arXiv:1512.01274 M., Xiao, T., Xu, B., Zhang, C., heterogeneous systems. arXivN., pr preprint eprint . 25 and Zhang, Z. (2015). MXNet: A flexible and efficient machine learning library for arXiv arXiv:1512.01274 Chen, Y., Luo, T., Liu, S., Zhang, S., He, L.,pr Weprint ang, J., Li, L., Chen, T.,. Xu, heterogeneous distributed systems. 25 Z., Sun, N., et al. (2014b). DaDianNao: A machine-learning sup supercomputer. ercomputer. In Micr Micro oar archite chite chitectur ctur cturee Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., W ang, J., Li, L., Chen, T., Xu, Z., Sun, N., (MICR (MICRO), O), 2014 47th Annual IEEE/A IEEE/ACM CM International Symp Symposium osium on on,, pages 609–622. et al. Micr o ar chite ctur e DaDianNao: A machine-learning supercomputer. In IEEE.(2014b). 454 (MICRO), 2014 47th Annual IEEE/ACM International Symposium on , pages 609–622. Chilim Chilimbi, bi, 454 T., Suzue, Y., Apacible, J., and Kalyanaraman, K. (2014). Pro Project ject Adam: IEEE. Building an efficient and scalable deep learning training system. In 11th USENIX Chilim bi, T., Suzue, Y., Apacible, and and Kalyanaraman, K. (2014). Pro. ject Symp Symposium osium on Op Oper er erating ating Systems J., Design Implementation (OSDI’14) (OSDI’14). 450 Adam: Building an efficient and scalable deep learning training system. In 11th USENIX Symp erating Systems and Implementation (OSDI’14) Cho, K.,osium Raiko,on T.,Op and Ilin, A. (2010).Design Parallel temp tempering ering is efficient for learning restricted . 450 Boltzmann mac machines. hines. In IJCNN’2010 . 606, 617 Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is efficient for learning restricted Boltzmann machines. In IJCNN’2010 . 606, 617 731
BIBLIOGRAPHY
Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradien gradientt and adaptiv adaptivee learning rate for training restricted Boltzmann mac machines. hines. In ICML’2011 , pages 105–112. 676 Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for ICML’2011 Cho, K., van Merriënbo Merriënboer, er, B., Gulcehre, F.,, pages Sch Schw wenk, H., and training restricted Boltzmann machines.C.,InBougares, 105–112. 676Bengio, Y. (2014a). Learning phrase representations using RNN enco encoder-deco der-deco der-decoder der for statistical Cho, K., van Merriënboer, B.,ocGulcehre, Schwds enk, and Y. mac machine hine translation. In Pr Pro eedings ofC., theBougares, EmpiricialF.,Metho Methods in H., Natur Natural al Bengio, Language (2014a). Learning phrase representations using RNN enco der-deco der for statistical Pr Pro ocessing (EMNLP 2014) 2014).. 397, 477, 478 machine translation. In Proceedings of the Empiricial Methods in Natural Language ProK., cessing Cho, Van(EMNLP Merriën Merriënb b2014) oer, B., Bahdanau, . 397 , 477, 478 D., and Bengio, Y. (2014b). On the properties of neural machine translation: Encoder-deco Encoder-decoder der approaches. ArXiv e-prints , Cho, K., Van Merriën abs/1409.1259 abs/1409.1259. . 414boer, B., Bahdanau, D., and Bengio, Y. (2014b). On the properties of neural machine translation: Encoder-decoder approaches. ArXiv e-prints , abs/1409.1259 Choromansk Choromanska, a, A., .Henaff, 414 M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The loss surface of multila ultilay yer net netw works. 285, 286 Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The Choro Chorowski, J.,ofBahdanau, Cho, Bengio, Y. (2014). End-to-end con contin tin tinuous uous loss wski, surface multilayerD., netw orks.K., 285and , 286 sp speec eec eech h recognition using atten attention-based tion-based recurren recurrentt NN: First results. arXiv:1412.1602. Choro 463 wski, J., Bahdanau, D., Cho, K., and Bengio, Y. (2014). End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv:1412.1602. Christianson, B. (1992). Automatic Hessians by reverse accumulation. IMA Journal of 463 Numeric Numerical al Analysis Analysis,, 12(2), 135–150. 224 Christianson, B. (1992). Automatic Hessians by reverse accumulation. IMA Journal of Numerical 12(2), Chrupala, G.,Analysis Kadar, ,A., and 135–150. Alishahi, 224 A. (2015). Learning language through pictures. arXiv 1506.03694. 414 Chrupala, G., Kadar, A., and Alishahi, A. (2015). Learning language through pictures. Ch Chung, ung, J., Gulcehre, 414 C., Cho, K., and Bengio, Y. (2014). Empirical ev evaluation aluation of gated arXiv 1506.03694. recurren recurrentt neural net netw works on sequence mo modeling. deling. NIPS’2014 Deep Learning workshop, CharXiv ung, J., Gulcehre,414 C.,, 463 Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated 1412.3555. recurrent neural networks on sequence modeling. NIPS’2014 Deep Learning workshop, Ch Chung, ung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2015a). Gated feedback recurren recurrentt arXiv 1412.3555. 414 , 463 neural net netw works. In ICML’15 . 414 Chung, J., Gülçehre, Ç., Cho, K., and Bengio, Y. (2015a). Gated feedback recurrent ICML’15 Ch Chung, ung, J., Dinh, L., Go Goel, el, K., Courville, A., and Bengio, Y. (2015b). A neural netKastner, works. InK., . 414 recurren recurrentt laten latentt variable mo model del for sequen sequential tial data. In NIPS’2015 . 700 Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., and Bengio, Y. (2015b). A NIPS’2015 Ciresan, D., Meier, Masci, Schmidh Schmidhuber, uber, J. In (2012). Multi-column recurren t laten t vU., ariable moJ., del and for sequen tial data. . 700 deep neural net netw work for traffic sign classification. Neur Neural al Networks , 32, 333–338. 23, 200 Ciresan, D., Meier, U., Masci, J., and Schmidhuber, J. (2012). Multi-column deep neural al Networks Ciresan, D.for C.,traffic Meier, U.,classification. Gambardella,Neur L. M., and Schmidh Schmidhub uber, er, J. (2010). network sign , 32, ub 333–338. 23, 200Deep big simple neural nets for handwritten digit recognition. Neur Neural al Computation Computation,, 22, 1–14. Ciresan, 24, 27, D. 449C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation , 22, 1–14. Coates, and Ng, A. Y. (2011). The imp importance ortance of enco encoding ding versus training with sparse 24, 27A. , 449 co coding ding and vector quan quantization. tization. In ICML’2011 . 27, 254, 501 Coates, A. and Ng, A. Y. (2011). The importance of encoding versus training with sparse Coates, H.,quan andtization. Ng, A. In Y. ICML’2011 (2011). An. 27 analysis of single-la single-lay yer netw networks orks in codingA., andLee, vector , 254, 501 unsup unsupervised ervised feature learning. In Pr Pro oceedings of the Thirte Thirteenth enth International Confer Conferenc enc encee Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-la y er netw orks in on Artificial Intel Intelligenc ligenc ligencee and Statistics (AIST (AISTA ATS 2011) 2011).. 364, 365, 458 Pr o c e e dings of the Thirte enth International Confer enc e unsupervised feature learning. In 732 ATS 2011) . 364, 365, 458 on Artificial Intel ligence and Statistics (AIST
BIBLIOGRAPHY
Coates, A., Huv Huval, al, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N. (2013). Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester, editors, Coates, A., Huv ang, T., Wu, Confer D., Catanzaro, B., andLeAndrew, N. (2013)., Pr Pro oceedings of al, theB., 30thWInternational Conferenc enc encee on Machine arning (ICML-13) (ICML-13), Deep learning with COTS HPC systems. In S. Dasgupta and D. McAllester, volume 28 (3), pages 1337–1345. JMLR Workshop and Conference Pro Proceedings. ceedings.editors, 24, 27, Pr o c e e dings of the 30th International Confer enc e on Machine L e arning (ICML-13), 365, 450 volume 28 (3), pages 1337–1345. JMLR Workshop and Conference Proceedings. 24, 27, Cohen, N., Sharir, O., and Shashua, A. (2015). On the expressiv expressivee power of deep learning: 365, 450 A tensor analysis. arXiv:1509.05009. 557 Cohen, N., Sharir, O., and Shashua, A. (2015). On the expressive power of deep learning: Collob Collobert, ert, R.analysis. (2004). arXiv:1509.05009. Lar arge ge Sc Scale ale Machine Learning arning.. Ph.D. thesis, Univ Université ersité de Paris VI, A tensor 557 LIP6. 196 Collobert, R. (2004). Large Scale Machine Learning . Ph.D. thesis, Université de Paris VI, Collob Collobert, ert,196 R. (2011). Deep learning for efficient discriminative parsing. In AIST AISTA ATS’2011 . LIP6. 101, 480 Collobert, R. (2011). Deep learning for efficient discriminative parsing. In AISTATS’2011 . Collob Collobert, architecture hitecture for natural language pro processing: cessing: 101,ert, 480R. and Weston, J. (2008a). A unified arc Deep neural net netw works with multitask learning. In ICML’2008 . 474, 480 Collobert, R. and Weston, J. (2008a). A unified architecture for natural language processing: Collob Collobert, R. and Weston, eston, (2008b).learning. A unified arc architecture hitecture for, natural language Deepert, neural netwW orks withJ.multitask In ICML’2008 . 474 480 pro processing: cessing: Deep neural netw networks orks with multitask learning. In ICML’2008 . 538 Collobert, R. and Weston, J. (2008b). A unified architecture for natural language ICML’2008 Collob Collobert, ert, R., Deep Bengio, S., and Y.multitask (2001). Alearning. parallel In mixture of SVMs processing: neural netwBengio, orks with . 538for very large scale problems. Technical Rep Report ort IDIAP-RR-01-12, IDIAP IDIAP.. 453 Collobert, R., Bengio, S., and Bengio, Y. (2001). A parallel mixture of SVMs for very Collob Collobert, R.,problems. Bengio, S.,Tand Bengio, Para Parallel llel mixture of. SVMs largeert, scale echnical RepY. ort(2002). IDIAP-RR-01-12, IDIAP 453 for very large scale problems. Neur Computation 14 (5), 1105–1114. 453 Neural al Computation,, Collobert, R., Bengio, S., and Bengio, Y. (2002). Parallel mixture of SVMs for very large NeurJ., al Bottou, Computation Collob Collobert, R., Weston, L., Karlen, M.,1105–1114. Kavuk Kavukcuoglu, cuoglu, scaleert, problems. , 14(5), 453K., and Kuksa, P. (2011a). Natural language processing (almost) from scratch. The Journal of Machine Learning Collob ert, W, eston, J., Bottou, L., ,Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011a). Rese esear ar arch chR., , 12 2493–2537. 329, 480 538, 539 Natural language processing (almost) from scratch. The Journal of Machine Learning Reseert, archR., Collob Collobert, Kavukcuoglu, cuoglu, K., arabet, environviron, 12Kavuk , 2493–2537. 329,and 480F , arab 538,et, 539C. (2011b). Torch7: A Matlab-like en men mentt for mac machine hine learning. In BigLe BigLearn, arn, NIPS Workshop orkshop.. 25, 210, 449 Collobert, R., Kavukcuoglu, K., and Farabet, C. (2011b). Torch7: A Matlab-like environBigLearn, NIPS Workshop Comon, P. (1994). analysis - a new .concept? Signal Pr Pro ocessing essing,, ment for machineIndependent learning. Incomponent 25, 210, 449 36 36,, 287–314. 494 Comon, P. (1994). Independent component analysis - a new concept? Signal Processing , 36, 287–314. 20,, Cortes, C. and 494 Vapnik, V. (1995). Supp Support ort vector netw networks. orks. Machine Learning , 20 273–297. 18, 140 Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning , 20, Couprie, C., 18 Farab arabet, Najman, jman, L., and LeCun, Y. (2013). Indo Indoor or semantic segmentation 273–297. , 140et, C., Na using depth information. In International Confer Conferenc enc encee on Learning Repr epresentations esentations Couprie, C., Farab C., Na jman, L., and LeCun, Y. (2013). Indoor semantic segmentation (ICLR2013) (ICLR2013). . 23et, , 200 using depth information. In International Conference on Learning Representations (ICLR2013)M., Courbariaux, David, vid, J.-P J.-P.. (2015). Lo Low w precision arithmetic for deep . 23Bengio, , 200 Y., and Da learning. In Arxiv:1412.7024, ICLR’2015 Workshop orkshop.. 455 Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Low precision arithmetic for deep Arxiv:1412.7024, orkshop Courville, J., and ICLR’2015 Bengio, Y. W (2011). Unsupervised mo models dels of images by learning.A., In Bergstra, . 455 spik spike-and-slab e-and-slab RBMs. In ICML’11 . 564, 683 Courville, A., Bergstra, J., and Bengio, Y. (2011). Unsupervised models of images by spike-and-slab RBMs. In ICML’11 . 564733 , 683
BIBLIOGRAPHY
Courville, A., Desjardins, G., Bergstra, J., and Bengio, Y. (2014). The spik spike-and-slab e-and-slab RBM and extensions to discrete and sparse data distributions. Pattern Analysis and Courville, Desjardins, G., Bergstra, J., and Bengio, Y. (2014).685 The spike-and-slab MachineA., Intel Intelligenc ligenc ligence, e, IEEE Transactions on on,, 36 (9), 1874–1887. RBM and extensions to discrete and sparse data distributions. Pattern Analysis and Machine Intel ligenc e, IEEE Transactions on , 36(9), Co Cov ver, T. M. and Thomas, J. A. (2006). Elements of Information Theory, ory, 2nd Edition Edition.. 1874–1887. The 685 Wiley-In Wiley-Interscience. terscience. 73 Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd Edition . Co Cox, x, D. and Pinto, N. Wiley-In terscience. 73(2011). Beyond simple features: A large-scale feature search approac approach h to unconstrained face recognition. In Automatic Fac acee & Gestur Gesturee Recognition Coand x, D. and Pinto, N. (2011). Beyond simple features: A large-scale feature Workshops (FG 2011), 2011 IEEE International Confer Conferenc enc encee on on,, pages 8–15. search IEEE. A utomatic F ac e & Gestur e Recognition approach to unconstrained face recognition. In 364 and Workshops (FG 2011), 2011 IEEE International Conference on , pages 8–15. IEEE. Cramér, Mathematical al metho methods ds of statistics statistics.. Princeton Univ Universit ersit ersity y Press. 135, 364 H. (1946). Mathematic 295 Cramér, H. (1946). Mathematical methods of statistics . Princeton University Press. 135, 304,, Cric Crick, k, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Natur Naturee , 304 295 111–114. 612 Crick, F. H. C. and Mitchison, G. (1983). The function of dream sleep. Nature , 304 , Cyb Cybenk enk enko, o, G.612 (1989). Appro Approximation ximation by sup superp erp erpositions ositions of a sigmoidal function. Mathematics 111–114. of Contr Control, ol, Signals, and Systems Systems,, 2, 303–314. 197 Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Contr ol,Ranzato, Signals, and Dahl, G. E., M., Systems Mohamed, and Hinton, , 2, A., 303–314. 197 G. E. (2010). Phone recognition with the mean-co mean-cov variance restricted Boltzmann machine. In NIPS’2010 . 23 Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition Dahl, E.,mean-co Yu, D.,vDeng, and Acero, A. (2012). Context-dependent pre-trained deep withG.the arianceL.,restricted Boltzmann machine. In NIPS’2010 . 23 neural net netw works for large vocabulary sp speec eec eech h recognition. IEEE Transactions on Audio, Dahl, Yu, D., Deng, L., and Acero, A.33–42. (2012).462 Context-dependent pre-trained deep Sp SpeeeG. ch,E., and Language Pr Pro ocessing essing, , 20(1), neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, SpeeG. ch,E., andSainath, Language essing Dahl, T. Pr N.,ocand Hin Hinton, ton, G.33–42. E. (2013). Improving ving deep neural netw networks orks , 20 (1), 462 Impro for LVCSR using rectified linear units and drop dropout. out. In ICASSP’2013 . 462 Dahl, G. E., Sainath, T. N., and Hinton, G. E. (2013). Improving deep neural networks ICASSP’2013 Dahl, E., Jaitly Jaitly, N., and linear Salakhutdino Salakhutdinov, v, R. (2014). neural netw networks orks for for LG. VCSR using, rectified units and drop out. InMulti-task . 462 QSAR predictions. arXiv:1406.1231. 26 Dahl, G. E., Jaitly, N., and Salakhutdinov, R. (2014). Multi-task neural networks for Dauphin, Y. and Bengio, Y. (2013). 26 Sto Stochastic chastic ratio matc matching hing of RBMs for sparse QSAR predictions. arXiv:1406.1231. high-dimensional inputs. In NIPS26 . NIPS Foundation. 622 Dauphin, Y. and Bengio, Y. (2013). Stochastic ratio matching of RBMs for sparse NIPS26Y.. NIPS Dauphin, Y., Glorot,inputs. X., andInBengio, (2011). Large-scale622 learning of embeddings with high-dimensional Foundation. reconstruction sampling. In ICML’2011 . 474 Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with Dauphin, Y., Pascan Pascanu, u, R., Gulcehre, C., Cho, reconstruction sampling. In ICML’2011 . 474 K., Ganguli, S., and Bengio, Y. (2014). Iden Identifying tifying and attacking the saddle point problem in high-dimensional non-conv non-convex ex Dauphin, Y., Pascan u, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). optimization. In NIPS’2014 . 285, 286 , 288 Identifying and attacking the saddle point problem in high-dimensional non-convex Da Davis, vis, A., Rubinstein, M., Wadh adhw wa, N.,, 288 Mysore, G., Durand, F., and Freeman, W. T. optimization. In NIPS’2014 . 285 , 286 (2014). The visual microphone: Passiv Passivee reco recov very of sound from video. ACM Transactions Daon vis,Gr A., Rubinstein, M., Wadhwa, N., Mysore, G., Durand, F., and Freeman, W. T. Graphics aphics (Pr (Pro oc. SIGGRAPH) SIGGRAPH), , 33 (4), 79:1–79:10. 455 (2014). The visual microphone: Passive recovery of sound from video. ACM Transactions on Graphics (Proc. SIGGRAPH), 33(4), 79:1–79:10. 455 734
BIBLIOGRAPHY
Da Day yan, P. (1990). Reinforcement comparison. In Conne Connectionist ctionist Mo Models: dels: Pr Pro oceedings of the 1990 Conne Connectionist ctionist Summer Scho School ol , San Mateo, CA. 693 Dayan, P. (1990). Reinforcement comparison. In Connectionist Models: Proceedings of the Conne ctionist Scho ol , San of Da Day yan,1990 P. and Hin Hinton, ton, G. Summer E. (1996). Varieties Helmholtz machine. Neur Neural al Networks Networks,, Mateo, CA. 693 9(8), 1385–1403. 695 Dayan, P. and Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks , 9y(8), Da Day an, P ., Hinton, G. Neal, al, R. M., and Zemel, R. S. (1995). The Helmholtz machine. 1385–1403. 695E., Ne Neur Neural al computation omputation,, 7(5), 889–904. 695 Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The Helmholtz machine. NeurJ., al cCorrado, omputation 7(5), 889–904. Dean, G.,, Monga, R., Chen, 695K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior, A., Tuck ucker, er, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep Dean, Corrado, G., Monga, net netw wJ., orks. In NIPS’2012 . 25, R., 450Chen, K., Devin, M., Le, Q., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. (2012). Large scale distributed deep NIPS’2012 Dean, and In Kanaza Kanazaw wa, K.. (1989). model del for reasoning ab about out persistence and causation. netwT. orks. 25, 450A mo Computational Intel Intelligenc ligenc ligencee , 5(3), 142–150. 664 Dean, T. and Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intel ligenc e , F5urnas, Deerw S. T., G. W., Landauer, T. K., and Harshman, R. (1990). Deerwester, ester, S., Dumais, (3), 142–150. 664 Indexing by latent semantic analysis. Journal of the Americ American an So Society ciety for Information Deerw ester, S.,(6), Dumais, S. T., Scienc Science e , 41 391–407. 479F,urnas, 485 G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science , O. 41and Delalleau, Bengio, Y.479 (2011). sum-product duct netw networks. orks. In NIPS . (6), 391–407. , 485 Shallow vs. deep sum-pro 19, 557 Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In NIPS . Deng, J., Dong, W., So Socher, cher, R., Li, L.-J., Li, K., and Fei-F ei-Fei, ei, L. (2009). ImageNet: A 19, 557 Large-Scale Hierarc Hierarchical hical Image Database. In CVPR09 . 21 Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A CVPR09 Deng, J., Berg,Hierarc A. C.,hical Li, K., and Database. Fei-F ei-Fei, ei, L.In (2010a). What does es classifying more than Large-Scale Image . 21 do 10,000 image categories tell us? In Pr Pro oceedings of the 11th Eur Europ op opeean Confer Conferenc enc encee on Deng, J., Berg, A. C., Li, K., and F ei-F ei, L. (2010a). What do es classifying more than Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelb Heidelberg. erg. Springer-V Springer-Verlag. erlag. Pr o c e e dings of the 11th Eur op e an Confer enc e on 10,000 image categories tell us? In 21 Computer Vision: Part V , ECCV’10, pages 71–84, Berlin, Heidelberg. Springer-Verlag. Deng, methods ds and applications. Foundations and 21 L. and Yu, D. (2014). Deep learning – metho Trends in Signal Pr Pro ocessing . 463 Deng, L. and Yu, D. (2014). Deep learning – methods and applications. Foundations and Trends SignalM., ProYcu, essing Deng, L., in Seltzer, D., .A463 cero, A., Mohamed, A., and Hinton, G. (2010b). Binary co coding ding of sp speech eech sp spectrograms ectrograms using a deep auto-enco auto-encoder. der. In Intersp Interspeeech 2010 , Makuhari, Deng, L., Seltzer, M., Y u, D., A cero, A., Mohamed, A., and Hinton, G. (2010b). Binary Chiba, Japan. 23 Intersp e e ch 2010 , Makuhari, coding of speech spectrograms using a deep auto-encoder. In Denil, M., Bazzani, L., Laro Larochelle, chelle, H., and de F reitas, N. (2012). Learning where to attend Chiba, Japan. 23 with deep architectures for image tracking. Neur Neural al Computation Computation,, 24(8), 2151–2184. 368 Denil, M., Bazzani, L., Larochelle, H., and de Freitas, N. (2012). Learning where to attend Neur Computation Den Denton, ton,deep E., Chintala, S., Szlam, A., tracking. and Fergus, R.al(2015). Deep generative image mo models dels with architectures for image , 24(8), 2151–2184. 368 using a Laplacian pyramid of adversarial net netw works. NIPS . 703, 704, 720 Denton, E., Chintala, S., Szlam, A., and Fergus, R. (2015). Deep generative image models NIPS . of Desjardins, G. and Bengio, Empirical ev evaluation aluation con conv volutional using a Laplacian pyramidY.of(2008). adversarial networks. 703 , 704 , 720 RBMs for vision. T Technical echnical Rep Report ort 1327, Départemen Départementt d’Informatique et de Recherc Recherche he Op OpéraéraDesjardins, and Bengio, Y.tréal. (2008). tionnelle, G. Univ Université ersité de Mon Montréal. 685Empirical evaluation of convolutional RBMs for vision. Technical Report 1327, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal. 685 735
BIBLIOGRAPHY
Desjardins, G., Courville, A. C., Bengio, Y., Vincent, P P., ., and Delalleau, O. (2010). Temp empered ered Marko Markov v chain Monte Carlo for training of restricted Boltzmann machines. In Desjardins, G., Confer Courville, Bengio, Y., Vincent, P., and Delalleau, O. (2010). International Conferenc enc encee A. on C., Artificial Intel Intelligenc ligenc ligence e and Statistics Statistics, , pages 145–152. 606, T emp ered Marko v chain Monte Carlo for training of restricted Boltzmann machines. In 617 International Conference on Artificial Intelligence and Statistics , pages 145–152. 606, Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. 617 In NIPS’2011 . 633 Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. Desjardins, G., Simony Simonyan, Pascanu, u, R., et al. (2015). Natural neural netw networks. orks. In In NIPS’2011 . 633 an, K., Pascan Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems , pages 2062–2070. 321 Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In Advanc in Neur Information Processing Systems Devlin, J.,esZbib, R.,alHuang, Z., Lamar, T., Sch Schw wartz, ,R., and2062–2070. Makhoul, J.321 (2014). Fast pages and robust neural net netw work joint mo models dels for statistical machine translation. In Pr Pro oc. Devlin, J., Zbib, ACL’2014 . 476R., Huang, Z., Lamar, T., Schwartz, R., and Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. In Proc. ACL’2014 Devro Devroy ye, L. (2013). Generation ation ation.. SpringerLink : Bücher. . 476 Non-Uniform Random Variate Gener Springer New York. 696 Devroye, L. (2013). Non-Uniform Random Variate Generation . SpringerLink : Bücher. DiCarlo, J. New J. (2013). Mec Mechanisms hanisms underlying visual ob object ject recognition: Humans vs. Springer York. 696 neurons vs. mac machines. hines. NIPS Tutorial. 26, 367 DiCarlo, J. J. (2013). Mechanisms underlying visual ob ject recognition: Humans vs. Dinh, L., Krueger, D., andNIPS Bengio, Y. (2014). NICE: Non-linear indep independen enden endentt comp componen onen onents ts neurons vs. machines. Tutorial. 26, 367 estimation. arXiv:1410.8516. 496 Dinh, L., Krueger, D., and Bengio, Y. (2014). NICE: Non-linear independent components Donah Donahue, ue, J., Hendric Hendricks, ks, L. A., Guadarrama, S., Rohrbac Rohrbach, h, M., Venugopalan, S., Saenko, estimation. arXiv:1410.8516. 496 K., and Darrell, T. (2014). Long-term recurrent conv convolutional olutional net networks works for visual Donah ue, J., Hendric ks, L. A., Guadarrama, S., Rohrbac h, M., Venugopalan, S., Saenko, recognition and description. arXiv:1411.4389. 102 K., and Darrell, T. (2014). Long-term recurrent convolutional networks for visual Donoho, D. L.and anddescription. Grimes, C. (2003). Hessian eigenmaps: new lo locally cally linear embedding recognition arXiv:1411.4389. 102 tec techniques hniques for high-dimensional data. Technical Rep Report ort 2003-08, Dept. Statistics, Donoho, D. Univ L. and Grimes, Stanford Universit ersit ersity y. 163, C. 522(2003). Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Doso Dosovitskiy vitskiy vitskiy,,Univ A., Springenberg, J. T., and Brox, T. (2015). Learning to generate chairs with Stanford ersity. 163, 522 con conv volutional neural netw networks. orks. In Pr Pro oceedings of the IEEE Confer Conferenc enc encee on Computer Doso vitskiy , A., Springenberg, J. T., and Brox, T. (2015). Vision and Pattern Recognition gnition, , pages 1538–1546. 697Learning , 706, 707to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision Pattern Recognition Do Doy ya, K. and (1993). Bifurcations of recurrent neural netw networks orks in, gradient descent learning. , pages 1538–1546. 697 , 706 707 IEEE Transactions on Neur Neural al Networks Networks,, 1, 75–80. 403, 406 Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE TS. ransactions Neurn alumerical Networkssolution Dreyfus, E. (1962).on The of v403 ariational problems. Journal of , 1, 75–80. , 406 Mathematic Mathematical al Analysis and Applic Applications ations , 5(1), 30–45. 225 Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematic Analysis Applicationssolution Dreyfus, S. E. al (1973). Theand computational of optimal , 5(1), 30–45. 225 control problems with time lag. IEEE Transactions on Automatic Contr Control ol , 18(4), 383–385. 225 Dreyfus, S. E. (1973). The computational solution of optimal control problems with time IEEE Transactions Automatic Control Druc Druck H. and LeCun, Y.on(1992). Improving generalisation performance using double lag.ker, , 18(4), 383–385. 225 bac back-propagation. k-propagation. IEEE Transactions on Neur Neural al Networks Networks,, 3(6), 991–997. 270 Drucker, H. and LeCun, Y. (1992). Improving generalisation performance using double back-propagation. IEEE Transactions on Neural Networks , 3(6), 991–997. 270 736
BIBLIOGRAPHY
Duc Duchi, hi, J., Hazan, E., and Singer, Y. (2011). Adaptiv Adaptivee subgradient metho methods ds for online learning and stochastic optimization. Journal of Machine Learning Rese esear ar arch ch ch.. 307 Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online Journal of robust Machine Learning Resear ch . learning. Dudik, M., and Langford, J., and Li, L. (2011). Doubly policy ev evaluation aluation and learning stochastic optimization. 307 In Pr Pro oceedings of the 28th International Confer Conferenc enc encee on Machine le learning arning arning,, ICML ’11. Dudik, 485 M., Langford, J., and Li, L. (2011). Doubly robust policy evaluation and learning. In Proceedings of the 28th International Conference on Machine learning , ICML ’11. Dugas, Incorporating orating second-order 485 C., Bengio, Y., Bélisle, F., and Nadeau, C. (2001). Incorp functional kno knowledge wledge for better option pricing. In T. Leen, T. Dietteric Dietterich, h, and V. Tresp, Dugas, C., Bengio, Y., Bélisle, F., and Nadeau, C. (2001). Incorp orating second-order editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 13 (NIPS’00) (NIPS’00), , pages functionalMIT knowledge etter option pricing. In T. Leen, T. Dietterich, and V. Tresp, 472–478. Press. for 68, b196 editors, Advances in Neural Information Processing Systems 13 (NIPS’00) , pages Dziugaite, K., Press. Roy Roy,, D.68M., and Ghahramani, Z. (2015). Training generativ generativee neural net472–478.G. MIT , 196 works via maxim maximum um mean discrepancy optimization. arXiv pr preprint eprint arXiv:1505.03906 . Dziugaite, G. K., Roy, D. M., and Ghahramani, Z. (2015). Training generative neural net705 works via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906 . El 705 Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural netw networks orks for long-term dep dependencies. endencies. In NIPS’1995 . 401, 410 El Hihi, S. and Bengio, Y. (1996). Hierarchical recurrent neural networks for long-term Elk Elkahky ahky ahky, , A. M., Song, Y., and .He, (2015). A multi-view deep learning approac approach h for dep endencies. In NIPS’1995 401X. , 410 cross domain user mo modeling deling in recommendation systems. In Pr Pro oceedings of the 24th ElkInternational ahky, A. M., Confer Song, enc Y.,eand He, X.Wide (2015). A, multi-view deep483 learning approach for Conferenc ence on W orld Web eb, pages 278–288. cross domain user modeling in recommendation systems. In Proceedings of the 24th International Confer ence on and World Wide Webt ,inpages Elman, J. L. (1993). Learning developmen development neural netw networks: orks: importance ortance of 278–288. 483 The imp starting small. Co Cognition gnition gnition,, 48, 781–799. 329 Elman, J. L. (1993). Learning and development in neural networks: The importance of Cognition 48, 781–799. Erhan, D., small. Manzagol, P.-A., ,Bengio, Y., Bengio, Vincent, t, P. (2009). The difficulty starting 329 S., and Vincen of training deep architectures and the effect of unsup unsupervised ervised pre-training. In Pr Pro oceedings Erhan, D., Manzagol, P .-A., Bengio, Y., Bengio, S., and Vincen t, P . (2009). The difficulty of AIST AISTA ATS’2009 . 200 Pr o ceedings of training deep architectures and the effect of unsupervised pre-training. In of AIST TS’2009Y., Erhan, D.,ABengio, Courville, A., Manzagol, P ., Vincent, P ., and Bengio, S. (2010). . 200 Wh Why y do does es unsup unsupervised ervised pre-training help deep learning? J. Machine Learning Res. Erhan, D.,, Bengio, Y., Courville, A., Manzagol, P., Vincent, P., and Bengio, S. (2010). 532, 536 537 Why does unsupervised pre-training help deep learning? J. Machine Learning Res. Fahlman, E., Hin Hinton, ton, G. E., and Sejnowski, T. J. (1983). Massively parallel architectures 532, 536S., 537 for AI: NETL, thistle, and Boltzmann mac machines. hines. In Pr Pro oceedings of the National Fahlman, S. E., ton, G. E., and Sejnowski, T. J.. (1983). Confer Conferenc enc ence e onHin Artificial Intel Intelligenc ligenc ligence e AAAI-83 573, 656Massively parallel architectures for AI: NETL, thistle, and Boltzmann machines. In Proceedings of the National Confer e on Artificial IntelF., ligenc e asta AAAI-83 Fang, H.,enc Gupta, S., Iandola, Sriv Srivasta astav va, R.,. Deng, L., Dollár, P., Gao, J., He, X., 573, 656 Mitc Mitchell, hell, M., Platt, J. C., Zitnic Zitnick, k, C. L., and Zw Zweig, eig, G. (2015). From captions to visual Fang, H., Gupta, S., F., Srivasta concepts and bac back. k. Iandola, arXiv:1411.4952. 102va, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., and Zweig, G. (2015). From captions to visual Farab arabet, et, C.,and LeCun, Kavuk Kavukcuoglu, cuoglu, K., Akselrod, d, P., and concepts back.Y., arXiv:1411.4952. 102Culurciello, E., Martini, B., Akselro Talay alay,, S. (2011). Large-scale FPGA-based con conv volutional net netw works. In R. Bekkerman, Farab C., o, LeCun, Y.,Langford, Kavukcuoglu, K., Culurciello, E., Martini, B., Akselro d, P., and and M. et, Bilenk Bilenko, and J. editors, Sc Scaling aling up Machine Learning: Par Paral al allel lel T alay , S. (2011). Large-scale FPGA-based con v olutional net w orks. In R. Bekkerman, Distribute Distributed d Appr Appro oaches aches.. Cambridge Univ Universit ersit ersity y Press. 526 M. Bilenko, and J. Langford, editors, Scaling up Machine Learning: Parallel and Distributed Approaches . Cambridge University Press. 526 737
BIBLIOGRAPHY
Farab arabet, et, C., Couprie, C., Na Najman, jman, L., and LeCun, Y. (2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intel Intelligenc ligenc ligencee , Farab et, C., Couprie, 23 C.,, Na 35 35(8), (8), 1915–1929. 200jman, , 360 L., and LeCun, Y. (2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence , 35(8), Fei-F ei-Fei, ei, L., Fergus, R., erona, P. (2006). One-shot learning of ob object ject categories. 1915–1929. 23,and 200,P360 IEEE Transactions on Pattern Analysis and Machine Intel Intelligenc ligenc ligencee , 28(4), 594–611. 541 Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of ob ject categories. IEEE ransactions on Pattern AnalysisT., and Machine IntelAbb ligenc e , 28 Finn, C.,TT an, X. Y., Duan, Y., Darrell, Levine, S., and Abbeel, eel, P. (4), (2015). Learning 594–611. 541 visual feature spaces for rob robotic otic manipulation with deep spatial auto autoenco enco encoders. ders. arXiv Finn, C., TarXiv:1509.06113 an, X. Y., Duan, Y., pr preprint eprint . 25Darrell, T., Levine, S., and Abbeel, P. (2015). Learning visual feature spaces for robotic manipulation with deep spatial autoencoders. arXiv preprint arXiv:1509.06113 Fisher, R. A. (1936). The use. of 25multiple measurements in taxonomic problems. Annals of Eugenics , 7, 179–188. 21, 105 Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics Földiák, P. (1989). Adaptiv daptivee 21 netw network ork for optimal linear feature extraction. In International , 7, 179–188. , 105 Joint Confer Conferenc enc encee on Neur Neural al Networks (IJCNN) (IJCNN),, volume 1, pages 401–405, Washington Földiák, P. (1989). daptiv network for optimal linear feature extraction. In International 1989. IEEE, NewAY ork. e497 Joint Conference on Neural Networks (IJCNN), volume 1, pages 401–405, Washington Franzius, M., Sprekeler, H., 497 and Wisk Wiskott, ott, L. (2007). Slowness and sparseness lead to place, 1989. IEEE, New York. head-direction, and spatial-view cells. 498 Franzius, M., Sprekeler, H., and Wiskott, L. (2007). Slowness and sparseness lead to place, Franzius, M., Wilb Wilbert, ert, spatial-view N., and Wiskott, (2008). Inv Invariant ariant ob object ject recognition with slow head-direction, and cells. L. 498 feature analysis. In Artificial Neur Neural al Networks-ICANN 2008 , pages 961–970. Springer. Franzius, M., Wilbert, N., and Wiskott, L. (2008). Invariant ob ject recognition with slow 499 feature analysis. In Artificial Neural Networks-ICANN 2008 , pages 961–970. Springer. Frasconi, P., Gori, M., and Sp Sperduti, erduti, A. (1997). On the efficient classification of data 499 structures by neural netw networks. orks. In Pr Pro oc. Int. Joint Conf. on Artificial Intel Intelligenc ligenc ligencee . 401, Frasconi, P., Gori, M., and Sperduti, A. (1997). On the efficient classification of data 403 structures by neural networks. In Proc. Int. Joint Conf. on Artificial Intel ligence . 401, Frasconi, P P., ., Gori, M., and Sp Sperduti, erduti, A. (1998). A general framework for adaptive 403 pro processing cessing of data structures. IEEE Transactions on Neur Neural al Networks , 9 (5), 768–786. Frasconi, 401, 403P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks , 9 (5), 768–786. Freund, Y. and Schapire, R. E. (1996a). Experiments with a new boosting algorithm. In 401, 403 Machine Learning: Pr Pro oceedings of Thirte Thirteenth enth International Confer Conferenc enc encee , pages 148–156, Freund, Y. and Schapire, R. E. (1996a). Experiments with a new b o osting algorithm. In USA. ACM. 257 Machine Learning: Proceedings of Thirteenth International Conference , pages 148–156, Freund, and 257 Schapire, R. E. (1996b). Game theory theory,, on-line prediction and boosting. In USA. Y. ACM. Pr Pro oceedings of the Ninth Annual Confer Conferenc enc encee on Computational Learning The Theory ory ory,, pages Freund, Y. and 325–332. 257 Schapire, R. E. (1996b). Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory , pages Frey rey, , B. J. (1998). Gr Graphic aphic aphical al mo models dels for machine le learning arning and digital communic ommunication ation ation.. 325–332. 257 MIT Press. 707, 708 Frey, B. J. (1998). Graphical models for machine learning and digital communication . Frey rey, , B. Press. J., Hinton, Dayan, an, P. (1996). Do Does es the wak ake-sleep e-sleep algorithm learn go goo od MIT 707, G. 708E., and Day densit density y estimators? In D. Touretzky ouretzky,, M. Mozer, and M. Hasselmo, editors, Advanc dvances es Frey B. J.,alHinton, G. E., and an, Systems P. (1996).8 Do es the wak e-sleep661–670. algorithmMIT learnPress, good in, Neur Neural Information Pr Pro ocDay essing (NIPS’95) (NIPS’95), , pages densit y estimators? Cam Cambridge, bridge, MA. 654In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing Systems 8 (NIPS’95), pages 661–670. MIT Press, 738 Cambridge, MA. 654
BIBLIOGRAPHY
Frob robenius, enius, G. (1908). Über matrizen aus positiven elemen elementen, ten, s. B. Pr Preuss. euss. Akad. Wiss. Berlin, Germany. 600 Frobenius, G. (1908). Über matrizen aus positiven elementen, s. B. Preuss. Akad. Wiss. Berlin, Germany. Fukushima, K. (1975).600 Cognitron: A self-organizing multila multilay yered neural netw network. ork. Biolo Biologic gic gical al Cyb Cybernetics ernetics , 20, 121–136. 16, 226, 531 Fukushima, K. (1975). Cognitron: A self-organizing multilayered neural network. Biological CyberneticsK. Fukushima, (1980). Neocognitron: netw work mo model del for a , 20 , 121–136. 16, 226, 531A self-organizing neural net mec mechanism hanism of pattern recognition unaffected by shift in position. Biolo Biologic gic gical al Cyb Cybernetics ernetics , Fukushima, K. 16 (1980). A self-organizing neural network model for a 36 36,, 193–202. , 24, 27Neocognitron: , 226, 368 mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics , 36Y. Gal, and Ghahramani, (2015). Bay yesian conv convolutional olutional neural netw networks orks with Bernoulli , 193–202. 16, 24, 27Z. , 226 , 368 Ba appro approximate ximate variational inference. arXiv pr preprint eprint arXiv:1506.02158 . 263 Gal, Y. and Ghahramani, Z. (2015). Bayesian convolutional neural networks with Bernoulli preprint arXiv:1506.02158 Gallinari, P., LeCun, Y., Thiria, S., andarXiv Fogelman-Soulie, F. (1987). Memoires associatives ciatives approximate variational inference. . 263 asso distribuees. In Pr Pro oceedings of COGNITIV COGNITIVA A 87 , Paris, La Villette. 518 Gallinari, P., LeCun, Y., Thiria, S., and Fogelman-Soulie, F. (1987). Memoires associatives oceedings COGNITIV 87Grandv Garcia-Duran, A.,PrBordes, A.,ofUsunier, N., A and Y. (2015). Grandvalet, Combining bining tw two o distribuees. In , Paris,alet, La Villette. 518Com and three-wa three-way y embeddings mo models dels for link prediction in knowledge bases. arXiv pr preprint eprint Garcia-Duran, A., Bordes, arXiv:1506.00999 . 487 A., Usunier, N., and Grandvalet, Y. (2015). Combining two and three-way embeddings models for link prediction in knowledge bases. arXiv preprint arXiv:1506.00999 Garofolo, J. S., Lamel, . 487L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist sp speech eech disc 1-1.1. Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and P allett, D. S. (1993). NASA STI/R STI/Reecon Technic chnical al Rep eport ort N , 93, 27403. 462 Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASAJ.STI/R econThe Technic al Rsystem eport Nof Garson, (1900). metric identification , 93 , 27403. 462 of criminals, as used in Great Britain and Ireland. The Journal of the Anthr Anthrop op opolo olo ologic gic gical al Institute of Gr Greeat Britain and Garson, (1900). The 21 metric system of identification of criminals, as used in Great Ir Ireland elandJ. , (2), 177–227. Britain and Ireland. The Journal of the Anthropological Institute of Great Britain and Ireland Gers, F. A., Sc Schmidh hmidh hmidhub ub uber, er, Continual ual , (2), 177–227. 21 J., and Cummins, F. (2000). Learning to forget: Contin prediction with LSTM. Neur Neural al computation omputation,, 12(10), 2451–2471. 411, 415 Gers, F. A., Schmidhuber, J., and Cummins, F. (2000). Learning to forget: Continual NeurG. al cE. omputation Ghahramani, Z. and Hinton, (1996). The EM algorithm mixtures of factor prediction with LSTM. , 12(10), 2451–2471.for411 , 415 analyzers. Technical Rep Report ort CRG-TR-96-1, Dpt. of Comp. Sci., Univ. of Toron oronto. to. 492 Ghahramani, Z. and Hinton, G. E. (1996). The EM algorithm for mixtures of factor Gillic Gillick, k, D., Brunk, C., Rep Viny Vinyals, als,CRG-TR-96-1, O., and Subraman Subramany a, Comp. A. (2015). language analyzers. Technical ort Dpt. yof Sci., Multilingual Univ. of Toron to. 492 pro processing cessing from bytes. arXiv pr preprint eprint arXiv:1512.00103 . 480 Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. (2015). Multilingual language preprint arXiv:1512.00103 Girshic Girshick, k, R., Donah Donahue, Darrell, T., and Malik, J. (2015). Region-based conv convolutional olutional processing from bue, ytes.J.,arXiv . 480 net netw works for accurate ob object ject detection and segmentation. 429 Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2015). Region-based convolutional Giudice, M. for D., accurate Manera, V., anddetection Keysers, C. Programmed ontogeny y networks object and(2009). segmentation. 429 to learn? The ontogen of mirror neurons. Dev. Sci., 12(2), 350––363. 658 Giudice, M. D., Manera, V., and Keysers, C. (2009). Programmed to learn? The ontogeny Sci., 12 Glorot, X. and Bengio,Dev. Y. (2010). Understanding difficult difficulty y of training deep feedforw feedforward ard of mirror neurons. (2), 350––363.the 658 neural net netw works. In AIST AISTA ATS’2010 . 303 Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward ATS’2010 Glorot, and Bengio, Y.. (2011a). Deep sparse rectifier neural netw networks. orks. In neuralX., netBordes, works. A., In AIST 303 AIST AISTA ATS’2011 . 16, 173, 196, 226 Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In AISTATS’2011 . 16, 173, 196, 226 739
BIBLIOGRAPHY
Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale sen sentimen timen timentt classification: A deep learning approach. In ICML’2011 . 510, 540 Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale Goldb Goldberger, erger,t J., Ro Row weis, S., A Hin Hinton, ton, learning G. E., and Salakhutdino Salakhutdinov, v, R. (2005). Neighbourhoo od sentimen classification: deep approach. In ICML’2011 . 510Neighbourho , 540 comp componen onen onents ts analysis. In L. Saul, Y. Weiss, and L. Bottou, editors, Advanc dvances es in Neur Neural al Goldb erger, J., Pr Roow eis, S.,Systems Hinton, 17 G. (NIPS’04) E., and Salakhutdino v, R. Information Pro cessing (NIPS’04). . MIT Press. 115(2005). Neighbourhood components analysis. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17 A. (NIPS’04) Gong, S., McKenna, S., and Psarrou, (2000). .Dynamic Vision: acee MIT Press. 115 From Images to Fac Recognition gnition.. Imp Imperial erial College Press. 164, 522 Gong, S., McKenna, S., and Psarrou, A. (2000). Dynamic Vision: From Images to Face Rodfellow, ecognitionI., Go Goodfellow, Le, erial Q., Saxe, and 164 Ng,, 522 A. (2009). Measuring inv invariances ariances in deep . Imp CollegeA., Press. net netw works. In NIPS’2009 , pages 646–654. 254 Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep NIPS’2009 Go Goo odfello dfellow, w, I., N., Muja, Pan Pantofaru, tofaru, akay ay ayama, ama, L. (2010). net works. InKoenig, , pagesM., 646–654. 254 C., Sorokin, A., and Tak Help me help you: Interfaces for personal rob robots. ots. In Pr Pro oc. of Human Rob obot ot Inter Interaction action Go(HRI) odfello, w, I., a, Koenig, Muja, M., Pan tofaru, C.,100 Sorokin, A., and Takayama, L. (2010). (HRI), Osak Osaka, Japan.N., ACM Press, ACM Press. Help me help you: Interfaces for personal robots. In Proc. of Human Robot Interaction (HRI) Go Goo odfello dfellow, I. J. (2010). ACM Tec echnical hnical report: do downsampled wnsampled conv convolution olution , w, Osak a, Japan. Press, ACM Multidimensional, Press. 100 for autoenco autoencoders. ders. Technical rep report, ort, Université de Montréal. 358 Goodfellow, I. J. (2010). Technical report: Multidimensional, downsampled convolution Go Goo odfello dfellow, w, I. J. (2014). On distinguishability for estimating generative mo models. dels. for autoenco ders. Technical report, Universitécriteria de Montréal. 358 In International Confer Conferenc enc encee on Learning Repr epresentations, esentations, Workshops Track . 625, 702, Go703 odfellow, I. J. (2014). On distinguishability criteria for estimating generative models. In International Conference on Learning Representations, Workshops Track . 625, 702, Go Goodfellow, odfellow, I. J., Courville, A., and Bengio, Y. (2011). Spik Spike-and-slab e-and-slab sparse co coding ding 703 for unsup unsupervised ervised feature discov discovery ery ery.. In NIPS Workshop on Chal Challenges lenges in Learning GoHier odfellow, I. Mo J., dels Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding Hierar ar archic chic chical al Models dels. . 535, 541 for unsupervised feature discovery. In NIPS Workshop on Chal lenges in Learning Hier archical Go Goodfellow, odfellow, I. Mo J., dels Warde-F arde-Farley arley, . 535,arley 541 , D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout net netw works. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319– Go1327. odfellow, J.,, 345 Warde-F 192, I.263 , 366, arley 458 , D., Mirza, M., Courville, A., and Bengio, Y. (2013a). Maxout networks. In S. Dasgupta and D. McAllester, editors, ICML’13 , pages 1319– Go Goo odfello dfellow, w, ,I.263 J.,, Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep 1327. 192 345, 366 , 458 Boltzmann mac machines. hines. In NIPS26 . NIPS Foundation. 100, 620, 673, 674, 675, 676, 677, Go700 odfellow, I. J., Mirza, M., Courville, A., and Bengio, Y. (2013b). Multi-prediction deep Boltzmann machines. In NIPS26 . NIPS Foundation. 100, 620, 673, 674, 675, 676, 677, Go Goo odfello dfellow, w, I. J., Warde-F arde-Farley arley arley,, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascan Pascanu, u, R., 700 Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a mac machine hine learning research Golibrary odfello. w, I. J.,pr W arde-F arley, D., Lamblin, Dumoulin, V., Mirza, M., Pascanu, R., library. arXiv preprint eprint arXiv:1308.4214 . 25P , ., 449 Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research eprint arXiv:1308.4214 Go Goo odfello dfellow, I. J., pr Courville, A., and Bengio,. Y. Scaling up spik spike-and-slab e-and-slab mo models dels library .w,arXiv 25,(2013d). 449 for unsup unsupervised ervised feature learning. IEEE Transactions on Pattern Analysis and Machine GoIntel odfello w, I.e ,J., A., and500 Bengio, Y. (2013d). Intelligenc ligenc ligence 35Courville, (8), 1902–1914. , 501, 502 , 652, 685Scaling up spike-and-slab models IEEE T r ansactions on Pattern Analysis and Machine for unsupervised feature learning. Intel ligenc 35Mirza, Go Goo odfello dfellow, w, I.e ,J., M ., Xiao, D., Courville, A., and (8), 1902–1914. 500, 501, 502, 652, 685Bengio, Y. (2014a). An empirical in inv vestigation of catastrophic forgeting in gradient-based neural netw networks. orks. In ICLR’2014 . Go193 odfellow, I. J., Mirza, M., Xiao, D., Courville, A., and Bengio, Y. (2014a). An empirical investigation of catastrophic forgeting in gradient-based neural networks. In ICLR’2014 . 193 740
BIBLIOGRAPHY
Go Goo odfello dfellow, w, I. J., Shlens, J., and Szegedy Szegedy,, C. (2014b). Explaining and harnessing adv adverersarial examples. CoRR , abs/1412.6572. 267, 268, 270, 558, 559 Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014b). Explaining and harnessing adverGo I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,, 558 Warde-F CoRR , abs/1412.6572 Goodfellow, odfellow, arde-Farley arley,, D., Ozair, S., sarial examples. . 267 , 268 , 270 , 559 arley Courville, A., and Bengio, Y. (2014c). Generativ Generativee adv adversarial ersarial net netw works. In NIPS’2014 . Go547 odfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., , 691, 702 , 703 , 706 Courville, A., and Bengio, Y. (2014c). Generative adversarial networks. In NIPS’2014 . Go Goodfellow, odfellow, I. J., Bulato Bulatov, 547 , 691, 702 , 703 , 706 v, Y., Ibarz, J., Arnoud, S., and Shet, V. (2014d). Multi-digit num umb ber recognition from Street View imagery using deep con conv volutional neural net netw works. GoIn odfellow, I. J., Bulato v, Y., Ibarz, J., Arnoud, S., and Shet, International Confer Conferenc enc ence e on Learning Repr epresentations esentations . 25, V. 101(2014d). , 200, 201Multi-digit , 202, 391, num, b452 er recognition from Street View imagery using deep convolutional neural networks. 425 International Conference on Learning Representations . 25, 101, 200, 201, 202, 391, In Go Goo o dfello dfellow, w, I. J., Vin Viny yals, O., and Saxe, A. M. (2015). Qualitatively characterizing neural 425, 452 net netw work optimization problems. In International Confer Conferenc enc encee on Learning Repr epresentaesentaGotions odfello w, I. J., ,Vin tions. . 285 , 286 287y,als, 291O., and Saxe, A. M. (2015). Qualitatively characterizing neural network optimization problems. In International Conference on Learning Representations Go Goodman, odman, (2001). Classes for fast maxim maximum um en entrop trop tropy y training. In International . 285J., 286 , 287, 291 Confer Conferenc enc encee on Acoustics, Sp Speeech and Signal Pr Pro ocessing (ICASSP) (ICASSP),, Utah. 470 Goodman, J. (2001). Classes for fast maximum entropy training. In International Confer coustics, eechthe and Signal of Prloc ocessing (ICASSP) Gori, M. enc ande Ton esi,AA. (1992).SpOn problem local al minima in backpropagation. , Utah. 470 IEEE Transactions on Pattern Analysis and Machine Intel Intelligenc ligenc ligencee , PAMI-14 (1), 76–86. 284 Gori, M. and Tesi, A. (1992). On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and IntelBiometrika ligence , PAMI-14 Gosset, W. S. (1908). The probable errorMachine of a mean. , 6(1), 1–25. Originally (1), 76–86. 284 published under the pseudon pseudonym ym “Studen “Student”. t”. 21 Gosset, W. S. (1908). The probable error of a mean. Biometrika , 6(1), 1–25. Originally Gou Y.,pseudon and Corrado, G. (2014). Gouws, ws, S., Bengio, BilBOW WA: Fast bilingual distributed published under the ym “Studen t”. 21 BilBO represen representations tations without word alignmen alignments. ts. Tec echnical hnical rep report, ort, arXiv:1410.2455. 479, 542 Gouws, S., Bengio, Y., and Corrado, G. (2014). BilBOWA: Fast bilingual distributed Graf, H. P.tations and Jack Jackel, el, L. D. (1989). Analog neural netw network ork circuits. 479 Cir Circuits represen without word alignmen ts. Telectronic echnical rep ort, arXiv:1410.2455. ,cuits 542 and Devic Devices es Magazine, IEEE , 5(4), 44–49. 454 Graf, H. P. and Jackel, L. D. (1989). Analog electronic neural network circuits. Circuits and es Magazine, IEEE Gra Grav ves,Devic A. (2011). Practical variational inference netw works. In NIPS’2011 . 242 , 5(4), 44–49. 454for neural net NIPS’2011 Gra Grav Supervise ervise ervised dvSe Sequenc quenc quencee Linference ab abel el elling ling with Recurr current entwNeur Neural Networks . Studies Grav ves, es, A. A. (2012). (2011). Sup Practical ariational for neural net orks.alIn . 242 in Computational In Intelligence. telligence. Springer. 375, 396, 414, 463 Graves, A. (2012). Supervised Sequence Label ling with Recurrent Neural Networks . Studies Gra Grav A. (2013). Generating sequences with recurren recurrent t neural networks. orks. Tec echnical hnical rep report, ort, inves, Computational Intelligence. Springer. 375 , 396, 414 , 463 netw arXiv:1308.0850. 189, 411, 418, 422 Graves, A. (2013). Generating sequences with recurrent neural networks. Technical report, Gra Grav ves, A. and Jaitly Jaitly, , N. (2014). owards end-to-end sp speech eech recognition with recurrent arXiv:1308.0850. 189 , 411 , 418, T422 neural net netw works. In ICML’2014 . 411 Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent Gra Grav ves, A. Schmidh Schmidhuber, uber, J. (2005). neural netand works. In ICML’2014 . 411 Framewise phoneme classification with bidirectional LSTM and other neural netw network ork architectures. Neur Neural al Networks , 18(5), 602–610. Gra ves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirec396 tional LSTM and other neural network architectures. Neural Networks , 18(5), 602–610. Gra Grav ves, A. and Schmidh Schmidhuber, uber, J. (2009). Offline handwriting recognition with multidi396 mensional recurren recurrentt neural netw networks. orks. In D. Koller, D. Sc Sch huurmans, Y. Bengio, and Gra v es, A. and Schmidh uber, J. (2009). Offline handwriting recognition with multidiL. Bottou, editors, NIPS’2008 , pages 545–552. 396 mensional recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and 741 L. Bottou, editors, NIPS’2008 , pages 545–552. 396
BIBLIOGRAPHY
Gra Grav ves, A., Fernández, S., Gomez, F., and Schmidh Schmidhub ub uber, er, J. (2006). Connectionist temp temporal oral classification: Labelling unsegmented sequence data with recurrent neural netw networks. orks. In Gra ves, A., Fernández, S., Gomez, F., and Schmidh uber, J. (2006). Connectionist temporal ICML’2006 , pages 369–376, Pittsburgh, USA. 463 classification: Labelling unsegmented sequence data with recurrent neural networks. In ICML’2006 Gra Grav ves, A., Liwicki, Bunke,Pittsburgh, H., Schmidh Schmidhuber, uber,463 J., and Fernández, S. (2008). Uncon, pagesM., 369–376, USA. strained on-line handwriting recognition with recurren recurrentt neural netw networks. orks. In J. Platt, Gra v es, A., Liwicki, M., Bunke, H., Schmidh uber, J., and F ernández, S. (2008). D. Koller, Y. Singer, and S. Row Roweis, eis, editors, NIPS’2007 , pages 577–584. 396 Unconstrained on-line handwriting recognition with recurrent neural networks. In J. Platt, NIPS’2007 Gra Grav A., Liwicki, M.,and Fernández, S.,editors, Bertolami, R., Bunke, H.,577–584. and Schmidh Schmidhuber, D.ves, Koller, Y. Singer, S. Roweis, , pages 396 uber, J. (2009). A nov novel el connectionist system for unconstrained handwriting recognition. Pattern Gra es, A.,and Liwicki, M., F ernández, Bertolami, R., Bunke, Schmidh uber, J. Avnalysis Machine Intel Intelligenc ligenc ligence, e,S., IEEE Transactions on , 31H., (5),and 855–868. 411 (2009). A novel connectionist system for unconstrained handwriting recognition. Pattern Avnalysis Machine e, IEEE Transactions on , 31(5), with Gra Grav es, A., and Mohamed, A.,Intel andligenc Hinton, G. (2013). Sp Speech eech recognition deep 411 recurrent 855–868. neural net netw works. In ICASSP’2013 , pages 6645–6649. 396, 399, 401, 411, 413, 414, 463 Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent Gra Grav ves, net A.,works. Wayne, G., and Danihelk Danihelka, I. (2014a). Neural uring neural In ICASSP’2013 , pagesa,6645–6649. 396, 399 , 401, T 411 , 413,machines. 414, 463 arXiv:1410.5401. 25 Graves, A., Wayne, G., and Danihelka, I. (2014a). Neural Turing machines. Gra Grav v es, A., W ayne, G., and Danihelk Danihelka, a, I. (2014b). Neural T uring machines. arXiv pr preprint eprint arXiv:1410.5401. 25 arXiv:1410.5401 . 419, 421 Graves, A., Wayne, G., and Danihelka, I. (2014b). Neural Turing machines. arXiv preprint arXiv:1410.5401 Grefenstette, E., Hermann, . 419, 421K. M., Suleyman, M., and Blunsom, P. (2015). Learning to transduce with un unbounded bounded memory memory.. In NIPS’2015 . 421 Grefenstette, E., Hermann, K. M., Suleyman, M., and Blunsom, P. (2015). Learning to NIPS’2015 .B. Greff, K., Sriv Srivastav astav astava, a,bounded R. K., Koutník, R., and Schmidh Schmidhuber, uber, J. (2015). transduce with un memoryJ., . InSteunebrink, 421 LSTM: a searc search h space odyssey odyssey.. arXiv pr preprint eprint arXiv:1503.04069 . 415 Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmidhuber, J. (2015). preprint arXiv:1503.04069 Gregor, K.aand LeCun, Y.odyssey (2010a). Emergence of complex-like cells in a temp temporal oral pro product duct LSTM: searc h space . arXiv . 415 net netw work with local receptive fields. Tec echnical hnical report, arXiv:1006.0448. 353 Gregor, K. and LeCun, Y. (2010a). Emergence of complex-like cells in a temporal product Gregor, K. with and LeCun, Y. (2010b). approximations of sparse coding. ding. In network local receptive fields.Learning Technicalfast report, arXiv:1006.0448. 353 co L. Bottou and M. Littman, editors, Pr Pro oceedings of the Twenty-seventh International Gregor, K. and Y.L(2010b). Learning fast approximations of sparse coding. In Confer Conferenc enc ence e onLeCun, Machine earning (ICML-10) (ICML-10). . ACM. 655 L. Bottou and M. Littman, editors, Proceedings of the Twenty-seventh International ConferK., enceDanihelk on Machine arning A., (ICML-10) Gregor, Danihelka, a, I.,LeMnih, Blundell, C., and . ACM. 655 Wierstra, D. (2014). Deep autoregressiv autoregressivee netw networks. orks. In International Confer Conferenc enc encee on Machine Learning (ICML’2014) (ICML’2014).. Gregor, K., Danihelk a, I., Mnih, A., Blundell, C., and Wierstra, D. (2014). Deep 695 International Confer enc e on Machine L e arning (ICML’2014) autoregressive networks. In . Gregor, Danihelka, a, I., Grav Graves, es, A., and Wierstra, D. (2015). DRA DRAW: W: A recurrent neural 695 K., Danihelk net netw work for image generation. arXiv pr preprint eprint arXiv:1502.04623 . 700 Gregor, K., Danihelka, I., Graves, A., and Wierstra, D. (2015). DRAW: A recurrent neural arXiv prM. eprint arXiv:1502.04623 Gretton, K. M., Rasch, J., Sc Schölk hölk hölkopf, opf, B., and. Smola, A. (2012). A networkA., forBorgwardt, image generation. 700 kernel two-sample test. The Journal of Machine Learning Rese esear ar arch ch , 13(1), 723–773. Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A 705 kernel two-sample test. The Journal of Machine Learning Research , 13(1), 723–773. Gülçehre, Ç. and Bengio, Y. (2013). Kno Knowledge wledge matters: Importance of prior information 705 for optimization. In International Confer Conferenc enc encee on Learning Repr epresentations esentations (ICLR’2013) (ICLR’2013).. Gülçehre, Ç. and Bengio, Y. (2013). Kno wledge matters: Importance of prior information 25 International Confer enc e on L e arning R epr esentations (ICLR’2013) for optimization. In . 742 25
BIBLIOGRAPHY
Guo, H. and Gelfand, S. B. (1992). Classification trees with neural netw network ork feature extraction. Neur Neural al Networks, IEEE Transactions on on,, 3(6), 923–933. 453 Guo, H. and Gelfand, S. B. (1992). Classification trees with neural network feature Neur al A., Networks, IEEE Transactions on , yanan, 3(6), 923–933. Gupta, S., Agra Agraw wal, Gopalakrishnan, K., and Nara Narayanan, P. (2015). extraction. 453Deep learning with limited numerical precision. CoRR, abs/1502.02551. 455 Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. (2015). Deep learning CoRR,Noise-con abs/1502.02551 Gutmann, M. and Hyv Hyvarinen, arinen, A. (2010). Noise-contrastiv trastiv trastivee estimation: A new estimawith limited numerical precision. . 455 tion principle for unnormalized statistical mo models. dels. In Pr Pro oceedings of The Thirte Thirteenth enth Gutmann, M. and Hyvenc arinen, A. (2010).Intel Noise-con e estimation: A new estimaInternational Confer Conferenc ence e on Artificial Intelligenc ligenc ligencee trastiv and Statistics (AIST (AISTA ATS’10) TS’10). . 623 tion principle for unnormalized statistical models. In Proceedings of The Thirteenth International Conferenc e., on Artificial Intel e andJ., Statistics ATS’10) Hadsell, R., Sermanet, P P., Ben, J., Erk Erkan, an, ligenc A., Han, Muller, (AIST U., and LeCun, . 623Y. (2007). Online learning for offroad rob robots: ots: Spatial lab label el propagation to learn long-range Hadsell, R., Sermanet, ., Ben, Erkan, A., eHan, J., Muller, U.,ta,and Y. tra trav versabilit ersability y. In Pr Pro oceePdings of RJ., ob obotics: otics: Scienc Science and Systems Systems, , Atlan Atlanta, GA,LeCun, USA. 456 (2007). Online learning for offroad robots: Spatial label propagation to learn long-range ProcPudlak, eedings of obotics:, Scienc e and Systems Ha Hajnal, jnal, A., Maass, P.,RSzegedy Szegedy, M., and Turan, G. (1993). circuits tra versabilit y. In W., , AtlanThreshold ta, GA, USA. 456 of bounded depth. J. Comput. System. Sci., 46, 129–154. 198 Ha jnal, A., Maass, W., Pudlak, P., Szegedy, M., and Turan, G. (1993). Threshold circuits J. Comput. Håstad, J. (1986). Almost optimalSystem. low lower er bSci. ounds small depth Pro oceedings of bounded depth. , 46for , 129–154. 198 circuits. In Pr of the 18th annual ACM Symp Symposium osium on The Theory ory of Computing , pages 6–20, Berkeley Berkeley,, Håstad, J. (1986). optimal lower bounds for small depth circuits. In Proceedings California. ACMAlmost Press. 198 of the 18th annual ACM Symposium on Theory of Computing , pages 6–20, Berkeley, Håstad, J. and Goldmann, M. (1991). On the pow ower er of small-depth threshold circuits. California. ACM Press. 198 Computational Complexity , 1, 113–129. 198 Håstad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity Hastie, T., Tibshirani, R., and (2001). The elements of statistic statistical al le learning: arning: , 1F,riedman, 113–129. J.198 data mining, infer inferenc enc encee and pr preediction diction.. Springer Series in Statistics. Springer Verlag. Hastie, 145 T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction . Springer Series in Statistics. Springer Verlag. He,145 K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-lev uman-level el performance on ImageNet classification. arXiv pr preprint eprint arXiv:1502.01852 . He,28K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing , 192 human-level performance on ImageNet classification. arXiv preprint arXiv:1502.01852 . Hebb, D. O. (1949). The Or Organization ganization of Behavior . Wiley Wiley,, New York. 14, 17, 658 28, 192 TheKavuk Organization of Behavior Henaff, M., Kavukcuoglu, cuoglu, K., and LeCun, Y. (2011). learning Hebb, D. O.Jarrett, (1949). K., . Wiley , New YUnsupervised ork. 14, 17, 658 of sparse features for scalable audio classification. In ISMIR’11 . 526 Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning Henderson, J. (2003). Inducingaudio history representations for broad cov coverage erage statistical of sparse features for scalable classification. In ISMIR’11 . 526 parsing. In HL HLT-NAACL T-NAACL T-NAACL,, pages 103–110. 480 Henderson, J. (2003). Inducing history representations for broad coverage statistical T-NAACL Henderson, J. HL (2004). Discriminativ Discriminative e training of a neural netw network ork statistical parser. In parsing. In , pages 103–110. 480 Pr Pro oceedings of the 42nd Annual Me Meeting eting on Asso Association ciation for Computational Linguistics Linguistics,, Henderson, J. (2004). Discriminative training of a neural network statistical parser. In page 95. 480 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics , Henniges, Lücke, e, J. (2010). Binary sparse page 95.M., 480Puertas, G., Bornschein, J., Eggert, J., and Lück co coding. ding. In Latent Variable Analysis and Signal Sep Separ ar aration ation ation,, pages 450–457. Springer. Henniges, M., Puertas, G., Bornschein, J., Eggert, J., and Lücke, J. (2010). Binary sparse 643 coding. In Latent Variable Analysis and Signal Separation , pages 450–457. Springer. 643 743
BIBLIOGRAPHY
Herault, J. and Ans, B. (1984). Circuits neuronaux à synapses mo modifiables: difiables: Déco Décodage dage de messages comp composites osites par appren apprentissage tissage non sup supervisé. ervisé. Comptes Rendus de l’A l’Accadémie Herault, J. and B.II-13) (1984). Circuits neuronaux à synapses modifiables: Décodage de des Scienc Sciences es es,, Ans, 299(I 299(II , 525––528. 494 messages composites par apprentissage non supervisé. Comptes Rendus de l’Académie des Scienc es , 299(I II-13) Hin Hinton, ton, G. (2012). Neural netw networks orks for machine learning. Coursera, video lectures. 307 , 525––528. 494 Hin Hinton, G.,(2012). Deng, L., Dahl,netw G. orks E., Mohamed, A., Jaitly Jaitly,, N., Senior, video A., Vanhouck anhoucke, V., ton, G. Neural for machine learning. Coursera, lectures.e,307 Nguy Nguyen, en, P., Sainath, T., and Kingsbury Kingsbury,, B. (2012a). Deep neural netw networks orks for acoustic Hin ton, G., Deng, L., recognition. Dahl, G. E., IEEE Mohamed, Jaitly , N.,Magazine Senior, A., anhouck e, 23 V.,, mo modeling deling in sp speech eech SignalA., Pr Pro ocessing Magazine, , 29V (6), 82–97. Nguy en, P ., Sainath, T., and Kingsbury , B. (2012a). Deep neural netw orks for acoustic 463 modeling in speech recognition. IEEE Signal Processing Magazine , 29(6), 82–97. 23, Hin Hinton, ton, G., Vin Viny yals, O., and Dean, J. (2015). Distilling the kno knowledge wledge in a neural net netw work. 463 arXiv pr preprint eprint arXiv:1503.02531 . 451 Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 Hin Hinton, ton, G. E. (1989). Connectionist learning pro procedures. cedures. Artificial Intel Intelligenc ligenc ligencee , 40 40,, . 451 185–234. 497 Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intel ligence , 40, Hin Hinton, ton, G. E.497 (1990). Mapping part-whole hierarchies into connectionist net netw works. Artificial 185–234. Intel Intelligenc ligenc ligencee , 46(1), 47–75. 421 Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intel 46(1), 47–75. Hin Hinton, ton,ligenc G. E.e ,(1999). Pro Products ducts experts. erts. In ICANN’1999 . 573 421of exp Hin Hinton, ton, G. E. (2000). raining products of exp experts erts by minimizing contrastiv trastiv trastivee div divergence. ergence. (1999). T Pro ductspro of ducts experts. In ICANN’1999 . 573 con Technical Rep Report ort GCNU TR 2000-004, Gatsby Unit, Universit University y College London. 613, Hin ton, G. E. (2000). Training products of experts by minimizing contrastive divergence. 678 Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. 613, Hin Hinton, ton, G. E. (2006). To recognize shap shapes, es, first learn to generate images. Technical Rep Report ort 678 UTML TR 2006-003, Univ Universit ersit ersity y of Toronto. 531, 598 Hinton, G. E. (2006). To recognize shapes, first learn to generate images. Technical Report Hin Hinton, ton, G.TRE.2006-003, (2007a). Univ Howersit to ydo UTML of backpropagation Toronto. 531, 598in a brain. Invited talk at the NIPS’2007 Deep Learning Workshop. 658 Hinton, G. E. (2007a). How to do backpropagation in a brain. Invited talk at the Hin Hinton, ton, G. E.Deep (2007b). Learning multiple layers ers of representation. Trends in cognitive NIPS’2007 Learning Workshop. 658lay scienc sciences es , 11(10), 428–434. 662 Hinton, G. E. (2007b). Learning multiple layers of representation. Trends in cognitive scienc Hin Hinton, ton, es G., 11 E. (10), (2010). A practical guide to training restricted Boltzmann mac machines. hines. 428–434. 662 Technical Report UTML TR 2010-003, Department of Computer Science, Univ Universit ersit ersity y of Hin G. 613 E. (2010). A practical guide to training restricted Boltzmann machines. Tton, oronto. Technical Report UTML TR 2010-003, Department of Computer Science, University of Hin Hinton, G. 613 E. and Ghahramani, Z. (1997). Generative mo models dels for discov discovering ering sparse Tton, oronto. distributed representations. Philosophic Philosophical al Transactions of the Royal So Society ciety of London . Hin ton, G. E. and Ghahramani, Z. (1997). Generative models for discovering sparse 146 distributed representations. Philosophical Transactions of the Royal Society of London . Hin Hinton, ton, G. E. and McClelland, J. L. (1988). Learning represen representations tations by recirculation. In 146 NIPS’1987 , pages 358–366. 505 Hinton, G. E. and McClelland, J. L. (1988). Learning representations by recirculation. In NIPS’1987 Hin Hinton, ton, G. E. ,and Row Roweis, eis, S. (2003). Stochastic chastic neigh neighb bor em emb bedding. In NIPS’2002 . 522 pages 358–366. 505 Sto Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002 . 522 744
BIBLIOGRAPHY
Hin Hinton, ton, G. E. and Salakhutdino Salakhutdinov, v, R. (2006). Reducing the dimensionality of data with neural net netw works. Scienc Sciencee , 313(5786), 504–507. 512, 527, 531, 532, 537 Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with Science , T. 313 Hin Hinton, ton, G.net E.wand J.(5786), (1986).504–507. Learning512 and, 527 relearning in Boltzmann machines. neural orks.Sejnowski, , 531, 532 , 537 In D. E. Rumelhart and J. L. McClelland, editors, Par Paral al allel lel Distribute Distributed d Pr Pro ocessing essing,, Hin G.1,E.cand Sejnowski, J. (1986). Learning relearning573 in, Boltzmann machines. vton, olume hapter 7, pagesT. 282–317. MIT Press, and Cambridge. 656 In D. E. Rumelhart and J. L. McClelland, editors, Paral lel Distributed Processing , Hin Hinton, G.1,E.chapter and Sejno Sejnowski, wski, 282–317. T. J. (1999). Unsupervise ervised d le learning: arning: of neur neural al vton, olume 7, pages MITUnsup Press,ervise Cambridge. 573foundations , 656 computation . MIT press. 544 Hinton, G. E. and Sejnowski, T. J. (1999). Unsupervised learning: foundations of neural computation Hin Hinton, ton, G. E. and Shallice, network: ork: in inv vestigations of . MIT press. T. 544(1991). Lesioning an attractor netw acquired dyslexia. Psycholo Psychologic gic gical al review eview,, 98(1), 74. 13 Hinton, G. E. and Shallice, T. (1991). Lesioning an attractor network: investigations of Psycholo gic(1994). al review Hin Hinton, ton, G. E. and Zemel, R. S. Autoenco Autoencoders, minimum description length, and acquired dyslexia. , 98(1), ders, 74. 13 Helmholtz free energy energy.. In NIPS’1993 . 505 Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and NIPS’1993 Hin Sejno and Ac Hinton, ton, G. E.,free Sejnowski, wski,. T. Ackley kley kley, , D. H. (1984). Boltzmann machines: Constrain Constraintt Helmholtz energy InJ., . 505 satisfaction netw networks orks that learn. Technical Rep Report ort TR-CMU-CS-84-119, Carnegie-Mellon Hin ton,ersit G. E., Sejnowski, T. J., andScience. Ackley, 573 D. H. (1984). Boltzmann machines: Constraint Univ Universit ersity y, Dept. of Computer , 656 satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon Hin Hinton, ton,ersit G. yE., McClelland, J., and Rumelhart, D. (1986). Distributed representations. Univ , Dept. of Computer Science. 573, 656 In D. E. Rumelhart and J. L. McClelland, editors, Par Paral al allel lel Distribute Distributed d Pr Pro ocessing: Hin ton, G. E., McClelland, J., and Rumelhart, D. (1986). Distributed representations. Explor Explorations ations in the Micr Microstructur ostructur ostructuree of Co Cognition gnition gnition,, volume 1, pages 77–109. MIT Press, In D.bridge. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Cam Cambridge. 17, 225, 529 Explorations in the Microstructure of Cognition , volume 1, pages 77–109. MIT Press, Hin Hinton, ton,bridge. G. E., 17 Rev Revo ow,, M., Dayan, an, P. (1995a). Recognizing handwritten digits using Cam , 225 529 and Day mixtures of linear mo models. dels. In G. Tesauro, D. Touretzky ouretzky,, and T. Leen, editors, Advanc dvances es Hin G.alE., Revow, M.,Pr and Dayan, P. (1995a). Recognizing using inton, Neur Neural Information Pro ocessing Systems 7 (NIPS’94) (NIPS’94), , pageshandwritten 1015–1022. digits MIT Press, mixtures of linear models. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances Cam Cambridge, bridge, MA. 492 in Neural Information Processing Systems 7 (NIPS’94) , pages 1015–1022. MIT Press, Hin Hinton, ton,bridge, G. E., MA. Day Dayan, an, rey,, B. J., and Neal, R. M. (1995b). The wak ake-sleep e-sleep algorithm Cam 492P., Frey for unsupervised neural netw networks. orks. Scienc Sciencee , 268, 1558–1161. 507, 654 Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995b). The wake-sleep algorithm Scienc e , 268,Mo Hin Hinton, ton, G. E., Day Dayan, an, P., and Revow, M. (1997). Modelling delling the manifolds of images of for unsupervised neural netwRevo orks.w, 1558–1161. 507 , 654 handwritten digits. IEEE Transactions on Neur Neural al Networks , 8, 65–74. 502 Hinton, G. E., Dayan, P., and Revow, M. (1997). Modelling the manifolds of images of ransactions onOsindero, Neural Networks Hin Hinton, ton, G. E., W elling,IEEE M., TT eh, Y. W., and S. (2001). new view handwritten digits. , 8,A65–74. 502 of ICA. In Pr Pro oceedings of 3r 3rd d International Confer Conferenc enc encee on Indep Independent endent Comp Component onent Analysis and Hin ton, G. E., W elling, M.,(ICA’01) Teh, Y. ,W., and746–751, Osindero, S. Diego, (2001).CA. A new Blind Signal Sep Separ ar aration ation (ICA’01), pages San 494view of ICA. In Proceedings of 3rd International Conference on Independent Component Analysis and Blind Signal SeparationS.,(ICA’01) Hin Hinton, ton, G. E., Osindero, and Teh,, pages Y. (2006). A fast learning 746–751, San Diego, algorithm CA. 494 for deep belief nets. Neur Neural al Computation , 18, 1527–1554. 14, 19, 27, 142, 531, 532, 662, 663 Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief al Computation Hin Hinton, ton, Neur G. E., Deng, L., Yu, Dahl, G. E., Mohamed, A., ,Jaitly Jaitly, N., Senior, A., nets. , 18D., , 1527–1554. 14, 19 , 27, 142, 531 532, ,662 , 663 Vanhouck anhoucke, e, V., Nguyen, P., Sainath, T. N., and Kingsbury Kingsbury,, B. (2012b). Deep neural Hin ton, G. for E.,acoustic Deng, L., Yu, D., Dahl, E., Mohamed, A., Jaitly N.,four Senior, A., net netw works mo modeling deling in sp speech eech G. recognition: The shared views, of researc research h Vanhouck e, V.,Signal Nguyen, ., Sainath, T.(6), N.,82–97. and Kingsbury , B. (2012b). Deep neural groups. IEEE Pr Pro ocPess. Mag. Mag.,, 29 101 networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6), 82–97. 101 745
BIBLIOGRAPHY
Hin Hinton, ton, G. E., Sriv Srivastav astav astava, a, N., Krizhevsky Krizhevsky,, A., Sutskev Sutskever, er, I., and Salakhutdino Salakhutdinov, v, R. (2012c). Impro Improving ving neural netw networks orks by prev preven en enting ting co-adaptation of feature detectors. Tec echnical hnical Hin ton, G.arXiv:1207.0580. E., Srivastava, N.,239 Krizhevsky rep report, ort, , 261, 266, A., Sutskever, I., and Salakhutdinov, R. (2012c). Improving neural networks by preventing co-adaptation of feature detectors. Technical Hin Hinton, ton, E., Viny Vinyals, als, O.,239 and Dean, rep ort,G.arXiv:1207.0580. , 261 , 266J. (2014). Dark knowledge. Invited talk at the Ba BayLearn yLearn Ba Bay y Area Mac Machine hine Learning Symp Symposium. osium. 451 Hinton, G. E., Vinyals, O., and Dean, J. (2014). Dark knowledge. Invited talk at the Ho Hoc chreiter, Un Untersuc tersuc tersuchungen hungen zu dynamischen Ba yLearn S. Ba(1991). y Area Mac hine Learning Symp osium. 451 neuronalen Netzen. Diploma thesis, T.U. Münic Münich. h. 18, 403, 405 Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma Ho Hoc chreiter, S. Münic and Schmidh Schmidhub ub uber, er, J. (1995). Simplifying neural nets by disco discov vering flat thesis, T.U. h. 18, 403 , 405 minima. In Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 7 , pages 529–536. MIT HoPress. chreiter, 243S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems 7 , pages 529–536. MIT Ho Hoc chreiter, S. and Schmidh hmidhuber, uber, J. (1997). Long short-term memory memory.. Neur Neural al Computation Computation,, Press. 243 9(8), 1735–1780. 18, 411, 414 Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation , 9c(8), Ho Hoc hreiter, S., Informatik, F., Bengio, Y., Frasconi, P., and Sc Schmidh hmidh hmidhuber, uber, J. (2000). 1735–1780. 18, 411,F. 414 Gradien Gradientt flo flow w in recurrent nets: the difficulty of learning long-term dependencies. In HoJ. chreiter, S., Informatik, F. F., Bengio, Y., Frasconi, P., al and Schmidh uber, J. .(2000). Kolen and S. Kremer, editors, Field Guide to Dynamic Dynamical Recurr current ent Networks IEEE Gradien t flo w in recurrent nets: the difficulty of learning long-term dependencies. In Press. 414 Field Guide to Dynamic al R e curr ent Networks J. Kolen and S. Kremer, editors, . IEEE Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural net netw work Press. 414 hardw hardware are implemen implementations. tations. Computers, IEEE Transactions on on,, 42(3), 281–290. 454 Holi, J. L. and Hwang, J.-N. (1993). Finite precision error analysis of neural network Computers, Transactions on , 42using Holt, J. L. Baker,tations. T. E. (1991). Bac Back kIEEE propagation sim simulations ulations limited precihardw areand implemen (3), 281–290. 454 sion calculations. In Neur Neural al Networks, 1991., IJCNN-91-Se IJCNN-91-Seattle attle International Joint Holt, J. L. and T. 2, E.pages (1991). Back propagation Confer Conferenc enc ence e onBaker, , volume 121–126. IEEE. 454 simulations using limited precision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint ConferK., encStinchcom e on , volume Hornik, Stinchcomb be, M., and White, H. IEEE. (1989).454 Multila Multilay yer feedforward net netw works are 2, pages 121–126. univ universal ersal appro approximators. ximators. Neur Neural al Networks , 2, 359–366. 197 Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are Neur al Networks Hornik, K., appro Stinc Stinchcom hcom hcombe, be, M., and White, H. approximation of an universal ximators. , 2,(1990). 359–366.Universal 197 unkno unknown wn mapping and its deriv derivatives atives using multila multilayer yer feedforward netw networks. orks. Neur Neural al Hornik, K.,, 3Stinc be, M., networks networks, (5), hcom 551–560. 197 and White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks 3(5), 551–560. Hsu, F.-H. ,(2002). Behind De Deep ep Blue: Building the Computer That Defe Defeate ate ated d the World 197 Chess Champion . Princeton Universit University y Press, Princeton, NJ, USA. 2 Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World ChessF. Champion Huang, and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for Mark Marko ov . Princeton Universit y Press,pseudo-likelihoo Princeton, NJ, d USA. 2 random fields on lattice. Annals of the Institute of Statistic Statistical al Mathematics Mathematics,, 54(1), 1–18. Huang, 619 F. and Ogata, Y. (2002). Generalized pseudo-likelihood estimates for Markov random fields on lattice. Annals of the Institute of Statistical Mathematics , 54(1), 1–18. Huang, 619 P.-S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (2013). Learning deep structured seman semantic tic models for web search using clickthrough data. In Pr Pro oceedings of Huang, P .-S., He, X., Gao, J., Deng, L., A cero, A., and Heck, L. (2013). Learning deep the 22nd ACM international confer onferenc enc encee on Confer Conferenc enc encee on information & know knowle le ledge dge Pr o c e e dings of structured seman tic 2333–2338. models for web search management , pages ACM. 483 using clickthrough data. In the 22nd ACM international conference on Conference on information & know ledge management , pages 2333–2338. ACM. 483 746
BIBLIOGRAPHY
Hub Hubel, el, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monk monkey ey striate cortex. Journal of Physiolo Physiology gy (L (London) ondon) ondon),, 195, 215–243. 365 Hubel, D. and Wiesel, T. (1968). Receptive fields and functional architecture of monkey gy (LReceptiv ondon), 195 Hub Hubel, el, D.cortex. H. andJournal Wiesel,ofT.Physiolo N. (1959). Receptive e fields of single striate , 215–243. 365neurons in the cat’s striate cortex. Journal of Physiolo Physiology gy , 148, 574–591. 365 Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s Physiolo gy , 148Receptiv Hub Hubel, el, D.cortex. H. andJournal Wiesel,of T. N. (1962). Receptive e fields, binocular cular interaction, and striate , 574–591. 365 bino 160,, functional architecture in the cat’s visual cortex. Journal of Physiolo Physiology gy (L (London) ondon) ondon),, 160 Hub el, D. H. 106–154. 365and Wiesel, T. N. (1962). Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160 , Huszar, F. (2015). How w (not) to train your generative model: schedule sampling, likelihoo likelihood, d, 106–154. 365 Ho adv adversary? ersary? arXiv:1511.05101 . 699 Huszar, F. (2015). How (not) to train your generative model: schedule sampling, likelihood, arXiv:1511.05101 Hutter, F., Ho Hoos, os, H., and Leyton-Brown, K. (2011). Sequen Sequential tial mo model-based del-based optimization adversary? . 699 for general algorithm configuration. In LION-5 . Extended version as UBC Tech rep report ort Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Sequential model-based optimization TR-2010-10. 439 for general algorithm configuration. In LION-5 . Extended version as UBC Tech report Hy Hyot ot otyniemi, yniemi, H. (1996). Turing machines are recurren recurrentt neural netw networks. orks. In ST STeP’96 eP’96 , pages TR-2010-10. 439 13–24. 380 Hyotyniemi, H. (1996). Turing machines are recurrent neural networks. In STeP’96 , pages Hyv Hyvärinen, ärinen, A. (1999). Survey on indep independent endent comp component onent analysis. Neur Neural al Computing 13–24. 380 Surveys Surveys,, 2, 94–128. 494 Hyvärinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys , A. 2, (2005). Hyvärinen, of non-normalized statistical models using score matc matching. hing. 94–128.Estimation 494 Journal of Machine Learning Rese esear ar arch ch ch,, 6, 695–709. 516, 620 Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score matching. Journal ofA.Machine Research Hyvärinen, (2007a).Learning Connections betw etween score matching, contrastivee divergence, , 6een , 695–709. 516, 620 contrastiv and pseudolik pseudolikeliho eliho elihood od for contin continuous-v uous-v uous-valued alued variables. IEEE Transactions on Neur Neural al Hyvärinen, A. (2007a). Connections b etw een score matching, contrastiv e divergence, Networks Networks,, 18, 1529–1531. 621 and pseudolikelihood for continuous-valued variables. IEEE Transactions on Neural NetworksA. Hyvärinen, Some 621 extensions of score matc matching. hing. Computational Statistics and , 18(2007b). , 1529–1531. Data Analysis Analysis,, 51, 2499–2512. 621 Hyvärinen, A. (2007b). Some extensions of score matching. Computational Statistics and Data Analysis Hyvärinen, A. and Hoyer, er, P. O. 621 (1999). Emergence of top topography ography and complex cell , 51Hoy , 2499–2512. prop properties erties from natural images using extensions of ica. In NIPS , pages 827–833. 496 Hyvärinen, A. and Hoyer, P. O. (1999). Emergence of topography and complex cell Hyvärinen, andnatural Pajunen, P P.. using (1999). Nonlinear independent analysis: propertiesA. from images extensions of ica. In NIPS , component pages 827–833. 496 Existence and uniqueness results. Neur Neural al Networks Networks,, 12(3), 429–439. 496 Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: NeurE. al (2001a). Networks ,Indep 12(3), Hyvärinen, and Oja, Independent endent Comp Component onent Analysis Analysis.. Existence A., andKarhunen, uniquenessJ., results. 429–439. 496 Wiley-In Wiley-Interscience. terscience. 494 Hyvärinen, A., Karhunen, J., and Oja, E. (2001a). Independent Component Analysis . Hyvärinen, A., Ho Hoy yer, P494 . O., and Inki, M. O. (2001b). Top opographic ographic indep independent endent comp componen onen onentt Wiley-Interscience. analysis. Neur Neural al Computation , 13(7), 1527–1558. 496 Hyvärinen, A., Hoyer, P. O., and Inki, M. O. (2001b). Topographic independent component Neur al Computation 13P(7), Hyv Hyvärinen, ärinen, A., Hurri, J., and Ho Hoy y,er, . O.1527–1558. (2009). Natur Natural al Image Statistics: A pr prob ob obabilistic abilistic analysis. 496 appr appro oach to early computational vision vision.. Springer-V Springer-Verlag. erlag. 371 Hyvärinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A probabilistic approach to early computational vision . Springer-Verlag. 371 747
BIBLIOGRAPHY
Iba, Y. (2001). Extended ensemble Monte Carlo. International Journal of Mo Modern dern Physics Physics,, C12 C12,, 623–656. 606 Iba, Y. (2001). Extended ensemble Monte Carlo. International Journal of Modern Physics , Ina H. and 606 Kurita, T. (2005). Improv C12 Inay yoshi, Improved ed generalization by adding both auto, 623–656. asso association ciation and hidden-la hidden-lay yer noise to neural-net neural-netw work-based-classifiers. IEEE Workshop Inaon yoshi, H. and Kurita, (2005). Improv ed generalization by adding both autoMachine Learning forT. Signal Pr Pro ocessing essing, , pages 141—-146. 518 association and hidden-layer noise to neural-network-based-classifiers. IEEE Workshop on S. Machine Learning for Signal Procnormalization: essing , pages 141—-146. Ioffe, and Szegedy Szegedy, , C. (2015). Batch Accelerating deep netw network ork training 518 by reducing in internal ternal co cov variate shift. 100, 318, 321 Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training Jacobs, R. A. (1988). rates of convergence by reducing internal Increased covariate shift. 100conv , 318ergence , 321 through learning rate adaptation. Neur Neural al networks , 1(4), 295–307. 307 Jacobs, R. A. (1988). Increased rates of convergence through learning rate adaptation. NeuralR.networks Jacobs, A., Jordan, M. 295–307. I., Nowlan, daptivee mixtures , 1(4), 307S. J., and Hinton, G. E. (1991). Adaptiv of local exp experts. erts. Neur Neural al Computation Computation,, 3, 79–87. 188, 453 Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures Neural Computation Jaeger, H. exp (2003). Adaptive e nonlinear system iden identification tification echo ho state net netw works. In of local erts. Adaptiv , 3, 79–87. 188, 453 with ec Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 15 . 406 Jaeger, H. (2003). Adaptive nonlinear system identification with echo state networks. In Advanc in NeurDiscov al Information Processing Systems 15 . 406 Jaeger, H.es(2007a). Discovering ering multiscale dynamical features with hierarchical echo state net netw works. Technical rep report, ort, Jacobs Universit University y. 401 Jaeger, H. (2007a). Discovering multiscale dynamical features with hierarchical echo state Jaeger, H. (2007b). Echo netw Scholarp dia network. ork. Scholarpe dia,, 2(9), 2330. 406 networks. Technical repstate ort, Jacobs Universit y. e401 Jaeger, short-term memory in echoedia state networks: orks: Details of a simulation Jaeger, H. H. (2012). (2007b).Long Echo state netw ork. Scholarp , 2netw (9), 2330. 406 study T echnical rep T ec rep Jacobs Univ Bremen. 407 study.. report, ort, echnical hnical report, ort, Universit ersit ersity y Jaeger, H. (2012). Long short-term memory in echo state networks: Details of a simulation Jaeger, Haas, H.ort, (2004). Harnessing Predicting chaotic studyH. . Tand echnical rep Technical report,nonlinearity: Jacobs Universit y Bremen. 407 systems and sa saving ving energy in wireless comm ommunication. unication. Scienc Sciencee , 304(5667), 78–80. 27, 406 Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and e , 304 Jaeger, Luk Lukosevicius, osevicius, Popovici, D., Scienc and Siew Siewert, ert,(5667), U. (2007). Optimization and savingH., energy in wirelessM., comm unication. 78–80. 27, 406 applications of echo state netw networks orks with leaky- in integrator tegrator neurons. Neur Neural al Networks , Jaeger, Lukosevicius, M., Popovici, D., and Siewert, U. (2007). Optimization and 20 20(3), (3),H., 335–352. 410 applications of echo state networks with leaky- integrator neurons. Neural Networks , 20(3), Jain, V., Murra Murray y, J.410 F., Roth, F., Turaga, S., Zhigulin, V., Briggman, K. L., Helmstaedter, 335–352. M. N., Denk, W., and Seung, H. S. (2007). Sup Supervised ervised learning of image restoration Jain, V.,con Murra y, J. F.,netw Roth, F., TIn uraga, S., Zhigulin, V.,2007. Briggman, L., Helmstaedter, with conv volutional networks. orks. Computer Vision, ICCVK.2007. IEEE 11th M. N., Denk, Confer W., and Seung, H. S. 1–8. (2007). Sup360 ervised learning of image restoration International Conferenc enc ence e on on,, pages IEEE. with convolutional networks. In Computer Vision, 2007. ICCV 2007. IEEE 11th International ConferG. enc(2011). e on , pages Jaitly Jaitly, , N. and Hinton, Learning a better360 representation of speech soundw soundwa aves 1–8. IEEE. using restricted Boltzmann mac machines. hines. In Acoustics, Sp Speeech and Signal Pr Pro ocessing Jaitly , N. and 2011 Hinton, G. (2011). Learning a better representation of speech soundw (ICASSP), IEEE International Confer Conferenc enc encee on on, , pages 5884–5887. IEEE. 461aves using restricted Boltzmann machines. In Acoustics, Speech and Signal Processing (ICASSP), IEEE enclength e on , pages Jaitly Jaitly, , N. and 2011 Hin Hinton, ton, G. International E. (2013). VoConfer cal tract perturbation (VTLP) improves 5884–5887. IEEE.improv 461 es sp recognition. In ICML’2013 . 241 speec eec eech h Jaitly, N. and Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves Jarrett, Kavuk Kavukcuoglu, cuoglu, K., Ranzato, speechK., recognition. In ICML’2013 . 241M., and LeCun, Y. (2009). What is the best multi-stage arc architecture hitecture for ob object ject recognition? In ICCV’09 . 16, 24, 27, 173, 192, 226, Jarrett, K.,, 526 Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best 364, 365 multi-stage architecture for ob ject recognition? In ICCV’09 . 16, 24, 27, 173, 192, 226, 748 364, 365, 526
BIBLIOGRAPHY
Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev. Lett. ett.,, 78 78,, 2690–2693. 628, 631 Jarzynski, C. (1997). Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78, 2690–2693. Ja Jaynes, ynes, E. T. (2003). Prob ob obability ability The Theory: ory: The Logic of Scienc Sciencee . Cam Cambridge bridge Universit University y 628,Pr 631 Press. 53 Jaynes, E. T. (2003). Probability Theory: The Logic of Science . Cambridge University Jean, S., 53 Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target Press. vocabulary for neural mac machine hine translation. arXiv:1412.2007. 477 Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2014). On using very large target Jelinek, F. andfor Mercer, L. hine (1980). Interpolated estimation of Mark Marko vocabulary neuralR.mac translation. arXiv:1412.2007. 477 ov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Jelinek, F.e .and Mercer, R. L.Amsterdam. (1980). Interpolated Pr Practic actic actice North-Holland, 465, 476estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in actic e . North-Holland, Jia,PrY. (2013). Caffe: An op open en source conv convolutional olutional Amsterdam. 465 , 476 architecture for fast feature embedding. http://caffe.berkeleyvision.org/ http://caffe.berkeleyvision.org/.. 25, 210 Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/ Jia, Y., Huang, C., and Darrell, T. (2012). spatial pyramids: Receptive field . 25, Beyond 210 learning for pooled image features. In Computer Vision and Pattern Recognition Jia, Y., Huang, andConfer Darrell, T. spatial pyramids: Receptive field (CVPR), 2012C., IEEE Conferenc enc ence e on(2012). , pages Beyond 3370–3377. IEEE. 346 learning for pooled image features. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE ence B. on ,G. Jim, K.-C., Giles, C. L., Confer and Horne, (1996). An analysis of noise pages 3370–3377. IEEE. 346 in recurrent neural net netw works: conv convergence ergence and generalization. IEEE Transactions on Neur Neural al Networks Networks,, Jim, K.-C., Giles, C. L., and Horne, B. G. (1996). An analysis of noise in recurrent neural 7(6), 1424–1438. 242 IEEE T r ansactions on Neur al Networks networks: convergence and generalization. , 7(6), 1424–1438. Jordan, M. I. (1998).242 Learning in Gr Graphic aphic aphical al Mo Models dels dels.. Klu Kluwer, wer, Dordrec Dordrech ht, Netherlands. 18 L in GrInferring aphical Mo dels . Kluwer, Joulin, Mik ikolo olo olov, v,earning T. (2015). algorithmic patterns with stack-augmen k-augmen k-augmented ted Jordan, A. M.and I. (1998). Dordrec ht,stac Netherlands. 18 recurren recurrentt nets. arXiv pr preprint eprint arXiv:1503.01007 . 421 Joulin, A. and Mikolov, T. (2015). Inferring algorithmic patterns with stack-augmented arXiv pr eprint arXiv:1503.01007 Jozefo Jozefowicz, wicz,tR., Zaremba, W., and Sutsk Sutskev ev ever, er, I. (2015). An empirical ev evaluation aluation of recurrent recurren nets. . 421 net netw work arc architectures. hitectures. In ICML’2015 . 306, 414, 415 Jozefowicz, R., Zaremba, W., and Sutskever, I. (2015). An empirical evaluation of recurrent ICML’2015 Judd, S. arc (1989). Neur Neural alInNetwork Design and the Complexity of Learning . MIT press. netwJ. ork hitectures. . 306 , 414 , 415 293 Judd, J. S. (1989). Neural Network Design and the Complexity of Learning . MIT press. Jutten, adaptivee 293 C. and Herault, J. (1991). Blind separation of sources, part I: an adaptiv algorithm based on neuromimetic architecture. Signal Pr Pro ocessing essing,, 24, 1–10. 494 Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive Processing Kahou, S. E.,based Pal, C., X., architecture. Froument roumenty y, PSignal ., Gülçehre, c., Memisevic, R.,494 Vincent, algorithm on Bouthillier, neuromimetic , 24, 1–10. P., Courville, A., Bengio, Y., Ferrari, R. C., Mirza, M., Jean, S., Carrier, P. L., Dauphin, Kahou, S. E., Pal, C., ando Bouthillier, X., Aggarwal, Froumenty,A., P., Gülçehre, Memisevic, Vincent, Y., Boulanger-Lew Boulanger-Lewando andowski, wski, N., Zumer, J.,c.,Lamblin, P.,R., Ra Raymond, ymond, P ., Courville, A., Bengio, Y., F errari, R. C., Mirza, M., Jean, S., Carrier, P . L., Dauphin, J.-P J.-P., ., Desjardins, G., Pascan Pascanu, u, R., Warde-F arde-Farley arley arley,, D., Torabi, A., Sharma, A., Bengio, Y., Côté, Boulanger-Lew ando N., Aggarwal, A., Zumer, J., Lamblin, ., ecific Raymond, E., M., Konda, K.wski, R., and Wu, Z. (2013). Combining mo modality dalityPsp specific deep J.-P., Desjardins, G.,emotion Pascanu, R., Warde-F arley, D., Torabi, A., of Sharma, A.,ACM Bengio, neural net netw works for recognition in video. In Pr Pro oceedings the 15th on E., Côté, M., Confer Konda, K. R.,Multimo and Wu, (2013). Combining modality specific deep International Conferenc enc ence e on Multimodal dalZ.Inter Interaction action action. . 200 neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International ence on Multimo dal Inter actioncontin Kalc Kalch hbrenner, N.Confer and Blunsom, P. (2013). Recurrent continuous models. dels. In . 200 uous translation mo EMNLP’2013 . 477 Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In 749 EMNLP’2013 . 477
BIBLIOGRAPHY
Kalc Kalch hbrenner, N., Danihelk Danihelka, a, I., and Grav Graves, es, A. (2015). Grid long short-term memory memory.. arXiv pr preprint eprint arXiv:1507.01526 . 397 Kalchbrenner, N., Danihelka, I., and Graves, A. (2015). Grid long short-term memory. arXiv preprint Kam Kamyshansk yshansk yshanska, a, H.arXiv:1507.01526 and Memisevic, .R.397 (2015). The potential energy of an auto autoenco enco encoder. der. IEEE Transactions on Pattern Analysis and Machine Intel Intelligenc ligenc ligencee . 518 Kamyshanska, H. and Memisevic, R. (2015). The potential energy of an autoencoder. IEEE yT ansactions on Pattern nalysis and Machine Intel ligenc Karpath Karpathy , rA. and Li, F.-F. (2015).ADeep visual-seman visual-semantic tic alignmen alignments tsefor generating image . 518 descriptions. In CVPR’2015 . arXiv:1412.2306. 102 Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image CVPR’2015 Karpath Karpathy y, A., TIn oderici, G., Shett Shetty y, S., Leung, T., Sukthank Sukthankar, ar, R., and Fei-F ei-Fei, ei, L. (2014). descriptions. . arXiv:1412.2306. 102 Large-scale video classification with conv convolutional olutional neural net netw works. In CVPR. 21 Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). CVPR. as Karush, W. (1939). Minima of Functions Sever Several al neural Variables Inequalities Large-scale video classification with convof olutional netwwith orks.Ine Inqualities 21 Side Constr Constraints aints . Master’s thesis, Dept. of Mathematics, Univ. of Chicago. 95 Karush, W. (1939). Minima of Functions of Several Variables with Inequalities as Side Constr aints Katz, S. M. (1987). Estimation of probabilities from sparse for the language model . Master’s thesis, Dept. of Mathematics, Univ.data of Chicago. 95 comp of a sp recognizer. IEEE T r ansactions on A c oustics, Sp e ch, and Signal componen onen onentt speech eech Spee Katz, M. (1987). Estimation of probabilities Pr Pro ocS. essing , ASSP-35 (3), 400–401. 465, 476 from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Pr occuoglu, essing , ASSP-35 Ka Kavuk vuk vukcuoglu, K., Ranzato, and LeCun, Y. (2008). F Fast ast inference in sparse co coding ding (3), M., 400–401. 465, 476 algorithms with applications to ob object ject recognition. Technical rep report, ort, Computational and KaBiological vukcuoglu,Learning K., Ranzato, M., and LeCun, Y. (2008). inference in sparse coding Lab, Couran Courant t Institute, NYU. TechFast Rep Report ort CBLL-TR-2008-12-01. algorithms with applications to ob ject recognition. T echnical rep ort, Computational and 526 Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01. Ka Kavuk vuk vukcuoglu, cuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning inv invariant ariant 526 features through topographic filter maps. In CVPR’2009 . 526 Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant Ka Kavuk vuk vukcuoglu, cuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., .Mathieu, M., and LeCun, Y. features through topographic filter maps. In CVPR’2009 526 (2010). Learning con conv volutional feature hierarc hierarchies hies for visual recognition. In NIPS’2010 . Ka365 vuk,cuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. 526 (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010 . 30(10), Kelley Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal , 30 (10), 365,, 526 947–954. 225 Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal , 30(10), Khan, F., Zhu, How w do humans teac teach: h: On curriculum learning 947–954. 225X., and Mutlu, B. (2011). Ho and teaching dimension. In Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 24 Khan, F., Zhu, X., and Mutlu, B. (2011). Ho w do h umans teac h: On curriculum learning (NIPS’11) (NIPS’11),, pages 1449–1457. 329 A dvanc es in Neur al Information Pr o c essing Systems 24 and teaching dimension. In (NIPS’11) Kim, S. K., McAfee, L. C., McMahon, P . L., and Olukotun, K. (2009). A highly scalable , pages 1449–1457. 329 restricted Boltzmann machine FPGA implementation. In Field Pr Pro ogr grammable ammable Logic Kim, K., McAfee, L. C.,FPL McMahon, P. L., and Olukotun, K. (2009). A 367–372. highly scalable andS.Applic Applications, ations, 2009. 2009. International Confer Conferenc enc ence e on on, , pages IEEE. restricted Boltzmann machine FPGA implementation. In Field Programmable Logic 454 and Applications, 2009. FPL 2009. International Conference on , pages 367–372. IEEE. Kindermann, R. (1980). Markov Random Fields and Their Applic Applications ations (Contemp (Contempor or orary ary 454 Mathematics ; V. 1) 1).. American Mathematical So Society ciety ciety.. 569 Kindermann, R. (1980). Markov Random Fields and Their Applications (Contemporary Mathematics Kingma, D. and; V. Ba,1)J.. American (2014). Adam: A metho method for .sto stochastic chastic optimization. arXiv Mathematical Sodciety 569 pr preprint eprint arXiv:1412.6980 . 308 Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv 750 preprint arXiv:1412.6980 . 308
BIBLIOGRAPHY
Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matc matching. hing. In NIPS’2010 . 516, 623 Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score Kingma, D., Rezende, D., Mohamed, Semi-supervised ervised learning matching. In NIPS’2010 . 516, 623S., and Welling, M. (2014). Semi-sup with deep generativ generativee models. In NIPS’2014 . 429 Kingma, D., Rezende, D., Mohamed, S., and Welling, M. (2014). Semi-supervised learning Kingma, D. generativ P. (2013).e models. Fast gradient-based with contin continuous uous latent variable with deep In NIPS’2014inference . 429 mo models dels in auxiliary form. Tec echnical hnical report, arxiv:1306.0733. 655, 691, 698 Kingma, D. P. (2013). Fast gradient-based inference with continuous latent variable Kingma, . and Welling, Auto-encoding variational bayes. In Pr Pro oceedings modelsD. in Pauxiliary form. M. Tec(2014a). hnical report, arxiv:1306.0733. 655,bay 691es. , 698 of the International Confer Conferenc enc encee on Learning Repr epresentations esentations (ICLR) (ICLR).. 691, 701 Kingma, D. P. and Welling, M. (2014a). Auto-encoding variational bayes. In Proceedings of the International on Learning Representations (ICLR) Kingma, D. P. and WConfer elling,enc M.e (2014b). Efficien Efficient t gradient-based inference through . 691, 701 transformations betw etween een bay bayes es nets and neural nets. Technical rep report, ort, arxiv:1402.0480. Kingma, D. P. and Welling, M. (2014b). Efficient gradient-based inference through 691 transformations between bayes nets and neural nets. Technical report, arxiv:1402.0480. Kirkpatric Kirkpatrick, k, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by sim simulated ulated 691 annealing. Scienc Sciencee , 220, 671–680. 328 Kirkpatrick, S., Jr., C. D. G., , and Vecchi, M. P. (1983). Optimization by simulated Science , 220 Kiros, R., Salakhutdino Salakhutdinov, v, R., and Zemel, models. dels. annealing. , 671–680. 328R. (2014a). Multimodal neural language mo In ICML’2014 . 102 Kiros, R., Salakhutdinov, R., and Zemel, R. (2014a). Multimodal neural language models. Kiros, R., Salakhutdino Salakhutdinov, In ICML’2014 . 102 v, R., and Zemel, R. (2014b). Unifying visual-semantic embeddings with multimodal neural language mo models. dels. arXiv: arXiv:1411.2539 1411.2539 [cs.LG]. 102, 411 Kiros, R., Salakhutdinov, R., and Zemel, R. (2014b). Unifying visual-semantic embeddings 1411.2539 Klemen Klementiev, tiev, A., Tito Titov, v, I., language and Bhattarai, (2012). Inducing[cs.LG] crosslingual with m ultimodal neural models.B.arXiv: . 102,distributed 411 represen representations tations of words. In Pr Pro oceedings of COLING 2012 . 479, 542 Klementiev, A., Titov, I., and Bhattarai, B. (2012). Inducing crosslingual distributed PrMorgan, oceedingsJ.,ofLee, COLING 2012 Kno Knowles-Barley wles-Barley wles-Barley, , S.,ofJones, D., Kasth Kasthuri, uri, N.,, Lic Lich represen tations words.T.InR., . 479 542htman, J. W., and Pfister, H. (2014). Deep learning for the connectome. GPU Technolo chnology gy Confer Conferenc enc encee . 26 Knowles-Barley, S., Jones, T. R., Morgan, J., Lee, D., Kasthuri, N., Lichtman, J. W., and GPU TeMo chnolo gy Confer ence .and Koller, D.H. and Friedman, N. (2009). Pr Prob ob obabilistic abilistic Gr Graphic aphic aphical al Models: dels: Principles Pfister, (2014). Deep learning for the connectome. 26 Techniques chniques.. MIT Press. 585, 598, 648 Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques Konig, Y., Bourlard, H., and . MIT Press. 585Morgan, , 598, 648N. (1996). REMAP: Recursive estimation and maximization of a posteriori probabilities – application to transition-based connectionist Konig, Bourlard, In H.,D.and Morgan, N.Mozer, (1996).and REMAP: Recursive estimation sp speec eec eech hY., recognition. Touretzky ouretzky, , M. M. Hasselmo, editors, Advanc dvances esand in maximization of a posteriori probabilities application to transition-based connectionist Neur Neural al Information Pr Pro ocessing Systems 8 –(NIPS’95) (NIPS’95). . MIT Press, Cambridge, MA. 462 speech recognition. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural ocessingsolution Systemsto8the (NIPS’95) Koren, Y.Information (2009). ThePr BellKor Netflix .grand prize. Cambridge, 256, 482 MIT Press, MA. 462 K otzias,Y.D., Denil,The M., BellKor de Freitas, N., and Smyth, P. (2015). From256 group Koren, (2009). solution to the Netflix grand prize. , 482to individual lab labels els using deep features. In ACM SIGKDD . 106 Kotzias, D., Denil, M., de Freitas, N., and Smyth, P. (2015). From group to individual ACM Koutnik, J., Greff, Gomez, andSIGKDD Schmidh Schmidhuber, clockw ckw ckwork ork RNN. In labels using deep K., features. In F., . uber, 106 J. (2014). A clo ICML’2014 . 410 Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork RNN. In KoICML’2014 čiský čiský,, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Repre. 410 sen sentations tations by Marginalizing Alignmen Alignments. ts. In Pr Pro oceedings of ACL CL.. 479 Kočiský, T., Hermann, K. M., and Blunsom, P. (2014). Learning Bilingual Word Representations by Marginalizing Alignments.751 In Proceedings of ACL. 479
BIBLIOGRAPHY
Krause, O., Fisc Fischer, her, A., Glasmac Glasmachers, hers, T., and Igel, C. (2013). Appro Approximation ximation properties of DBNs with binary hidden units and real-v real-valued alued visible units. In ICML’2013 . 556 Krause, O., Fischer, A., Glasmachers, T., and Igel, C. (2013). Approximation properties Krizhevsky Krizhevsky, A. (2010). Convolutional olutional beliefalued netw networks orks on units. CIF CIFAR-10. AR-10. Technical .rep report, ort, of DBNs, with binaryConv hidden units deep and real-v visible In ICML’2013 556 Univ Universit ersit ersity y of Toronto. Unpublished Manuscript: http://www.cs.utoron http://www.cs.utoronto.ca/ to.ca/ kriz/con kriz/convvKrizhevsky , A. (2010). cifar10-aug2010.p cifar10-aug2010.pdf. df. Conv 449 olutional deep belief networks on CIFAR-10. Technical report, University of Toronto. Unpublished Manuscript: http://www.cs.utoronto.ca/ kriz/convKrizhevsky Krizhevsky, , A. and df. Hin Hinton, ton, G. (2009). Learning multiple lay layers ers of features from tiny cifar10-aug2010.p 449 images. Technical rep report, ort, Universit University y of Toronto. 21, 564 Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny Krizhevsky Krizhevsky, and Hin Hinton, ton, E. (2011). very21 deep autoenco enco encoders ders for con conten ten tent-based t-based images. ,TA. echnical rep ort,G. Universit y ofUsing Toronto. , 564auto image retriev retrieval. al. In ESANN . 528 Krizhevsky, A. and Hinton, G. E. (2011). Using very deep autoencoders for content-based Krizhevsky Krizhevsky, , A.,al. Sutsk Sutskever, ever, I., .and image retriev In ESANN 528Hinton, G. (2012). ImageNet classification with deep con conv volutional neural net netw works. In NIPS’2012 . 23, 24, 27, 100, 200, 372, 457, 461 Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep Krueger, K. A. and Day . (2009). Flexible shaping: small Dayan, an,wPorks. how w, learning convolutional neural net In NIPS’2012 . 23, 24,ho 27 100, 200,in372 , 457steps , 461helps. Co Cognition gnition , 110, 380–394. 329 Krueger, K. A. and Dayan, P. (2009). Flexible shaping: how learning in small steps helps. Cognition Kuhn, H. W., 110 and, T380–394. uc uck ker, A.329 W. (1951). Nonlinear programming. In Pr Pro oceedings of the Se Seccond Berkeley Symp Symposium osium on Mathematic Mathematical al Statistics and Pr Prob ob obability ability , pages 481–492, Kuhn, H. W. and Universit Tucker, A. W.California (1951). Nonlinear Berk Berkeley eley eley, , Calif. University y of Press. 95 programming. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability , pages 481–492, Kumar, A.,, Calif. Irsoy Irsoy,, Universit O., Su, J., Bradbury Bradbury, , J.,Press. English, Ondruska, a, P., Iyy Iyyer, er, Berkeley y of California 95 R., Pierce, B., Ondrusk M., Gulra Gulrajani, jani, I., and So Socher, cher, R. (2015). Ask me an anything: ything: Dynamic memory net netw works Kumar, A., Irsoy , O., Su, J., Bradbury , J., English, R., Pierce, for natural language processing. arXiv:1506.07285 . 421 , 488 B., Ondruska, P., Iyyer, M., Gulrajani, I., and Socher, R. (2015). Ask me anything: Dynamic memory networks arXiv:1506.07285 Kumar, M. P.,language Pac Packer, ker, processing. B., and Koller, D. (2010). Self-paced learning for laten latentt variable for natural . 421, 488 mo models. dels. In NIPS’2010 . 329 Kumar, M. P., Packer, B., and Koller, D. (2010). Self-paced learning for latent variable NIPS’2010 Lang, K. J.Inand Hinton, G. E. (1988). The dev developmen elopmen elopmentt of the time-delay neural netw network ork models. . 329 arc architecture hitecture for sp speech eech recognition. Technical Rep Report ort CMU-CS-88-152, Carnegie-Mellon Lang, K. J. yand Hinton, G. E. (1988). The development of the time-delay neural network Univ Universit ersit ersity . 368 , 375, 409 architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon Lang, J., aib aibel, el, A. H., and Hinton, G. E. (1990). A time-delay neural netw network ork UnivK. ersit y. W 368 , 375 , 409 arc architecture hitecture for isolated word recognition. Neur Neural al networks networks,, 3(1), 23–43. 375 Lang, K. J., Waibel, A. H., and Hinton, G. E. (1990). A time-delay neural network Neural algorithm networks ,for 3(1), Langford, J. andfor Zhang, T. (2008). The ep epo och-greedy contextual multi-armed architecture isolated word recognition. 23–43. 375 bandits. In NIPS’2008 , pages 1096––1103. 483 Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for contextual multi-armed NIPS’2008 Lappalainen, Giannakopoulos, X., Honkela, A., and Karh Karhunen, unen, J. (2000). Nonlinear bandits. In H., , pages 1096––1103. 483 indep independen enden endentt comp componen onen onentt analysis using ensemble learning: Exp Experimen erimen eriments ts and discussion. Lappalainen, H., Giannakopoulos, X., Honkela, A., and Karhunen, J. (2000). Nonlinear In Pr Pro oc. ICA ICA. . Citeseer. 496 independent component analysis using ensemble learning: Experiments and discussion. oc. ICA Laro Laroc helle, H. .and Bengio, discriminativee restricted In cPr Citeseer. 496Y. (2008). Classification using discriminativ Boltzmann mac machines. hines. In ICML’2008 . 244, 254, 533, 688, 717 Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In ICML’2008 . 244, 254, 533, 688, 717 752
BIBLIOGRAPHY
Laro Larocchelle, H. and Hinton, G. E. (2010). Learning to combine fov foveal eal glimpses with a third-order Boltzmann machine. In Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems Laro and Hinton, 23 c, helle, pagesH. 1243–1251. 368 G. E. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in Neural Information Processing Systems Laro Laroc and Murra Murray y, I. (2011). The Neural Autoregressive Distribution Estimator. 23c,helle, pagesH. 1243–1251. 368 In AIST AISTA ATS’2011 . 707, 710 Larochelle, H. and Murray, I. (2011). The Neural Autoregressive Distribution Estimator. TS’2011 Laro Laroc helle,AH., Erhan, D.,, 710 and Bengio, Y. (2008). Zero-data learning of new tasks. In In cAIST . 707 AAAI Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee . 542 Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In Laro H., Bengio, Louradour, ande .Lamblin, P. (2009). Exploring strategies for Laroc chelle, AAAI Confer ence onY., Artificial IntelJ., ligenc 542 training deep neural net netw works. Journal of Machine Learning Rese esear ar arch ch ch,, 10, 1–40. 538 Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for Lasserre, A., Bishop, and Mink Minka, a, T.ofP.Machine (2006). Principled hybrids generative and Journal Learning R esearchof, 10 trainingJ. deep neural C. netM., works. , 1–40. 538 discriminativ discriminativee mo models. dels. In Pr Pro oceedings of the Computer Vision and Pattern Recognition Lasserre, J. A., Bishop, C. M., and87–94, Minka,W T.ashington, P. (2006). DC, Principled of generative and. Confer Conferenc enc ence e (CVPR’06) (CVPR’06), , pages USA. hybrids IEEE Computer So Society ciety ciety. discriminativ e models. In Proceedings of the Computer Vision and Pattern Recognition 244 , 252 Conference (CVPR’06) , pages 87–94, Washington, DC, USA. IEEE Computer Society. Le,244 Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled , 252 con conv volutional neural netw networks. orks. In J. Lafferty Lafferty,, C. K. I. Williams, J. Shaw Shawe-T e-T e-Taylor, aylor, Le,R.Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P . W., and Ng, A. (2010). Tiled Zemel, and A. Culotta, editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems con v olutional neural netw orks. In J. Lafferty , C. K. I. Williams, J. Shaw e-T aylor, 23 (NIPS’10) (NIPS’10),, pages 1279–1287. 353 R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems Le,23 Q.,(NIPS’10) Ngiam, J., Coates, A., Lahiri,353 A., Pro Prochno chno chnow, w, B., and Ng, A. (2011). On optimization , pages 1279–1287. metho methods ds for deep learning. In Pr Pro oc. ICML’2011 . ACM. 316 Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. (2011). On optimization Le,metho Q., Ranzato, M.,learning. Monga, In R.,Pr Devin, M., Corrado, G., 316 Chen, K., Dean, J., and Ng, oc. ICML’2011 ds for deep . ACM. A. (2012). Building high-level features using large scale unsup unsupervised ervised learning. In Le,ICML’2012 Q., Ranzato, . 24M., , 27 Monga, R., Devin, M., Corrado, G., Chen, K., Dean, J., and Ng, A. (2012). Building high-level features using large scale unsupervised learning. In LeICML’2012 Roux, N. and Y. (2008). Representational pow ower er of restricted Boltzmann . 24,Bengio, 27 mac machines hines and deep belief netw networks. orks. Neur Neural al Computation Computation,, 20(6), 1631–1649. 556, 657 Le Roux, N. and Bengio, Y. (2008). Representational power of restricted Boltzmann Lemac Roux, N. and and deep Bengio, Y. netw (2010). bal elief netw networks orks are compact universal556 approxiNeur Computation hines belief orks.Deep , 20 (6), 1631–1649. , 657 mators. Neur Neural al Computation , 22(8), 2192–2207. 556 Le Roux, N. and Bengio, Y. (2010). Deep belief networks are compact universal approxiLeCun, Y. Neur (1985). Une pro procédure cédure d’apprentissage tissage pour al Computation mators. , 22d’appren (8), 2192–2207. 556 Réseau à seuil assymétrique. In Co Cognitiva gnitiva 85: A la Frontièr ontièree de l’Intel l’Intelligenc ligenc ligencee Artificiel Artificielle, le, des Scienc Sciences es de la Connaissanc Connaissancee LeCun, Y. (1985). Une pro cédure d’appren tissage pour Réseau à seuil assymétrique. In et des Neur Neuroscienc oscienc osciences es es,, pages 599–604, Paris 1985. CEST CESTA, A, Paris. 225 Cognitiva 85: A la Frontière de l’Intelligence Artificielle, des Sciences de la Connaissance LeCun, (1986). Learning pro processes cesses in an asymmetric threshold netw network. ork. In F. Fogelmanet desY.Neur oscienc es , pages 599–604, Paris 1985. CEST A, Paris. 225 Soulié, E. Bienensto Bienenstock, ck, and G. Weisbuc eisbuch, h, editors, Disor Disorder der dereed Systems and Biolo Biologic gic gical al LeCun, Y. (1986). Learning processes in an asymmetric threshold ork.351 In F. FogelmanOr Organization ganization , pages 233–240. Springer-V Springer-Verlag, erlag, Les Houc Houches, hes, Fnetw rance. Soulié, E. Bienenstock, and G. Weisbuch, editors, Disordered Systems and Biological Organization LeCun, Y. (1987). Mo Modèles dèles connexionistes l’appr l’apprentissage entissage entissage. . Ph.D. thesis, Université ersité de , pages 233–240. Springer-Vde erlag, Les Houches, France. 351 Univ Paris VI. 18, 505, 518 LeCun, Y. (1987). Modèles connexionistes de l’apprentissage . Ph.D. thesis, Université de LeCun, Y. (1989). Generalization and net netw work design strategies. Technical Rep Report ort Paris VI. 18, 505, 518 CR CRG-TR-89-4, G-TR-89-4, Univ Universit ersit ersity y of Toronto. 331, 351 LeCun, Y. (1989). Generalization and network design strategies. Technical Report 753 CRG-TR-89-4, University of Toronto. 331 , 351
BIBLIOGRAPHY
LeCun, Y., Jac Jack kel, L. D., Boser, B., Denk Denker, er, J. S., Graf, H. P., Guyon, I., Henderson, D., Ho How ward, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications LeCun, Y., Jac kel, L. chips D., Boser, B., Denker,learning. J. S., Graf, H. PCommunic ., Guyon, ations I., Henderson, D.,, of neural netw network ork and automatic IEEE Communications Magazine Magazine, Ho w ard, R. E., and Hubbard, W. (1989). Handwritten digit recognition: Applications 27 27(11), (11), 41–46. 369 of neural network chips and automatic learning. IEEE Communications Magazine , 27(11), LeCun, Y.,41–46. Bottou, 369L., Orr, G. B., and Müller, K.-R. (1998a). Efficient backprop. In Neur Neural al Networks, Tricks of the Trade ade,, Lecture Notes in Computer Science LNCS 1524. LeCun, Y., VBottou, L.,, 432 Orr, G. B., and Müller, K.-R. (1998a). Efficient backprop. In Springer erlag. 310 Neural Networks, Tricks of the Trade , Lecture Notes in Computer Science LNCS 1524. LeCun, Y., VBottou, L.,, 432 Bengio, Y., and Haffner, P. (1998b). Gradien Gradientt based learning Springer erlag. 310 applied to document recognition. Pr Pro oc. IEEE . 16, 18, 21, 27, 372, 461, 463 LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient based learning oc. IEEE LeCun, Kavuk vuk vukcuoglu, cuoglu, K., and Pr Farab arabet, et, C.. 16 (2010). Conv net netw works and appliedY., to Ka document recognition. , 18, 21Con , 27v , olutional 372, 461, 463 applications in vision. In Cir Circuits cuits and Systems (ISCAS), Pr Pro oceedings of 2010 IEEE LeCun, Y., Kavuk cuoglu, et, C.IEEE. (2010). International Symp Symposium osium K., on , and pagesFarab 253–256. 372 Convolutional networks and applications in vision. In Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International SympEfficiency osium on , improv L’Ecuy L’Ecuyer, er, P. (1994). improvemen emen ementt and variance Pro oceedings of pages 253–256. IEEE. 372 reduction. In Pr the 1994 Winter Simulation Confer Conferenc enc encee , pages 122––132. 692 L’Ecuyer, P. (1994). Efficiency improvement and variance reduction. In Proceedings of theC.-Y., 1994 Xie, Winter Confer enceZ., Lee, S., Simulation Gallagher, P ., Zhang, and T u, Z. (2014). Deeply-supervised ervised nets. , pages 122––132. 692 Deeply-sup arXiv pr preprint eprint arXiv:1409.5185 . 327 Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. (2014). Deeply-supervised nets. arXiv eprint A., arXiv:1409.5185 Lee, H., pr Battle, Raina, R., and Ng, A. (2007). Efficient sparse co coding ding algorithms. . 327 In B. Sc Schölk hölk hölkopf, opf, J. Platt, and T. Hoffman, editors, Advanc dvances es in Neur Neural al Information Lee, Battle, A., Raina, R., and, Ng, A.801–808. (2007). MIT Efficient sparse Pr Pro oH., cessing Systems 19 (NIPS’06) (NIPS’06), pages Press. 640 coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information ProH., cessing SystemsC., 19 and (NIPS’06) Lee, Ek Ekanadham, anadham, Ng, A., (2008). Sparse deep mo model del for visual area pages 801–808. MITbelief Press.net 640 V2. In NIPS’07 . 254 Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area NIPS’07R., Lee, Ranganath, R., and Ng, A. Y. (2009). Conv Convolutional olutional deep belief V2.H., InGrosse, . 254 net netw works for scalable unsup unsupervised ervised learning of hierarc hierarchical hical representations. In L. Bottou Lee, R.,editors, Ranganath, and A. Y. (2009).International Convolutional deep beelief andH., M.Grosse, Littman, Pr Pro oceeR., dings of Ng, the Twenty-sixth Confer Conferenc enc ence on networks L for scalable unsupervised learning of hierarc hical representations. Machine earning (ICML’09) (ICML’09). . ACM, Montreal, Canada. 364, 685, 686 In L. Bottou and M. Littman, editors, Proceedings of the Twenty-sixth International Conference on Machine Learning (ICML’09) Lee, Y. J. and Grauman, K. (2011). the easy things first: self-paced visual . ACM,Learning Montreal, Canada. 364, 685 , 686 category disco discov very ery.. In CVPR’2011 . 329 Lee, Y. J. and Grauman, K. (2011). Learning the easy things first: self-paced visual CVPR’2011 Leibniz, G. disco W. (1676). using. the category very. InMemoir 329chain rule. (Cited in TMME 7:2&3 p 321-332, 2010). 224 Leibniz, G. W. (1676). Memoir using the chain rule. (Cited in TMME 7:2&3 p 321-332, Lenat, D. 224 B. and Guha, R. V. (1989). Building lar large ge know knowle le ledge-b dge-b dge-base ase ased d systems; repr epresentaesenta2010). tion and infer inferenc enc encee in the Cyc pr proje oje oject ct . Addison-W Addison-Wesley esley Longman Publishing Co., Inc. Lenat, D. B. and Guha, R. V. (1989). Building large know ledge-based systems; representa2 tion and inference in the Cyc project . Addison-Wesley Longman Publishing Co., Inc. Leshno, M., Lin, V. Y., Pinkus, A., and Schock Schocken, en, S. (1993). Multila Multilay yer feedforward 2 net netw works with a nonp nonpolynomial olynomial activ activation ation function can appro approximate ximate an any y function. Leshno, Lin, V. Pinkus, A., Neur Neural alM., Networks , 6Y., , 861––867. 197,and 198 Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks , 6, 861––867. 197, 198 754
BIBLIOGRAPHY
Lev Leven en enb berg, K. (1944). A metho method d for the solution of certain non-linear problems in least squares. Quarterly Journal of Applie Applied d Mathematics , II(2), 164–168. 312 Levenberg, K. (1944). A method for the solution of certain non-linear problems in least Quarterly JournalAnalyse of Applie d Mathematics L’Hôpital, F. A. (1696). des infiniment petits, pour l’intel l’intelligenc ligenc ligence squares. G. , II(2), 164–168. 312 e des lignes courb ourbes es es.. Paris: L’Imprimerie Roy Royale. ale. 224 L’Hôpital, G. F. A. (1696). Analyse des infiniment petits, pour l’intel ligence des lignes es . Paris: Li,courb Y., Swersky Swersky, , K., and Zemel,Roy R. ale. S. (2015). Generativ Generativee moment matc matching hing net netw works. L’Imprimerie 224 CoRR CoRR,, abs/1502.02761. 705 Li, Y., Swersky, K., and Zemel, R. S. (2015). Generative moment matching networks. CoRR Lin, T., Horne, B. G., Tino, .P705 ., and Giles, C. L. (1996). Learning long-term dependencies , abs/1502.02761 is not as difficult with NARX recurrent neural net netw works. IEEE Transactions on Neur Neural al Lin, T., Horne, B. G., Tino, P.,409 and Giles, C. L. (1996). Learning long-term dependencies Networks Networks, , 7(6), 1329–1338. is not as difficult with NARX recurrent neural networks. IEEE Transactions on Neural Networks Lin, Y., Liu, M., Liu,409 Y., and Zhu, X. (2015). Learning entit entity y and relation , 7Z., (6),Sun, 1329–1338. em emb beddings for kno knowledge wledge graph completion. In Pr Pro oc. AAAI’15 . 487 Lin, Y., Liu, Z., Sun, M., Liu, Y., and Zhu, X. (2015). Learning entity and relation c. AAAI’15 Linde, N. (1992). that changed the world, episode de 3. Do Documen cumentary embeddings forThe knomachine wledge graph completion. In Proepiso . cumen 487 tary miniseries. 2 Linde, N. (1992). The machine that changed the world, episode 3. Documentary miniseries. Lindsey Lindsey, , C. and Lindblad, T. (1994). Review of hardware neural netw networks: orks: a user’s 2 persp erspectiv ectiv ective. e. In Pr Pro oc. Thir Third d Workshop on Neur Neural al Networks: From Biolo Biology gy to High Lindsey , C. and Lindblad, T. (1994). Review of hardware neural netw orks: a user’s Ener Energy gy Physics , pages 195––202, Isola d’Elba, Italy Italy.. 454 Pr o c. Thir d W orkshop on Neur al Networks: F r om Biolo gy to High perspective. In Energy Physics Linnainmaa, S. (1976). T aylor expansion of the accum accumulated ulated rounding error. BIT , pages 195––202, Isola d’Elba, Italy. 454 Numeric Numerical al Mathematics , 16(2), 146–160. 225 Linnainmaa, S. (1976). Taylor expansion of the accumulated rounding error. BIT Numeric al Mathematics LISA (2008). Deep learning tutorials: Restricted machines. hines. Technical rep report, ort, , 16 (2), 146–160. 225 Boltzmann mac LISA Lab, Univ Université ersité de Mon Montréal. tréal. 591 LISA (2008). Deep learning tutorials: Restricted Boltzmann machines. Technical report, Long, M. and R. A. (2010). LISAP.Lab, UnivServedio, ersité de Mon tréal. 591 Restricted Boltzmann machines are hard to appro approximately ximately ev evaluate aluate or simulate. In Pr Pro oceedings of the 27th International Confer Conferenc enc encee Long, P. M. and Servedio, R. A. (2010). on Machine Learning (ICML’10) (ICML’10). . 660 Restricted Boltzmann machines are hard to approximately evaluate or simulate. In Proceedings of the 27th International Conference on Machine Learning Lotter, W., Kreiman, G.,(ICML’10) and Cox, .D.660 (2015). Unsupervised learning of visual structure using predictiv predictivee generativ generativee net netw works. arXiv pr preprint eprint arXiv:1511.06380 . 547, 548 Lotter, W., Kreiman, G., and Cox, D. (2015). Unsupervised learning of visual structure preprint“ Sk arXiv:1511.06380 Lo Lov velace,predictiv A. (1842). Noteseup upon on L. F.arXiv Menabrea’s Sketc etc etch h of the Analytical Engine using e generativ net works. . 547, 548 in inv ven ented ted by Charles Babbage”. 1 Lovelace, A. (1842). Notes upon L. F. Menabrea’s “ Sketch of the Analytical Engine Lu,inL., Cho, Babbage”. K., and Renals, S. (2015). A study of the recurrent neural netw network ork venZhang, ted by X., Charles 1 enco encoder-deco der-deco der-decoder der for large vocabulary sp speech eech recognition. In Pr Pro oc. Intersp Interspeeech . 463 Lu, L., Zhang, X., Cho, K., and Renals, S. (2015). A study of the recurrent neural network Proc. Intersp eech . 463 Lu,enco T.,der-deco Pál, D.,der andforPál, M.vocabulary (2010). Con Contextual multi-armed In International large sptextual eech recognition. In bandits. Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee and Statistics , pages 485–492. 483 Lu, T., Pál, D., and Pál, M. (2010). Contextual multi-armed bandits. In International Confer enceD.onG.Artificial Intelar ligenc and Statistics Luen Luenb berger, (1984). Line Linear ande Nonline Nonlinear ar Pr Pro ogr gramming amming amming. . Addison esley.. 317 , pages 485–492. 483Wesley LineH. ar (2009). and Nonline ar oir Procomputing gramming . approaches Luk Lukoševičius, Jaeger, Reserv Reservoir to recurren recurrent Luenoševičius, berger, D.M. G.and (1984). Addison Wesley . 317 t neural net netw work training. Computer Scienc Sciencee Review eview,, 3(3), 127–149. 406 Lukoševičius, M. and Jaeger, H. (2009). Reservoir computing approaches to recurrent 755e Review 3 neural network training. Computer Scienc , (3), 127–149. 406
BIBLIOGRAPHY
Luo, H., Shen, R., Niu, C., and Ullrich, C. (2011). Learning class-relev class-relevan an antt features and class-irrelev class-irrelevant ant features via a hybrid third-order RBM. In International Confer Conferenc enc encee on Luo, H., Shen, R., Niu,e and C., and Ullrich, C. (2011). Learning Artificial Intel Intelligenc ligenc ligence Statistics Statistics, , pages 470–478. 689 class-relevant features and class-irrelevant features via a hybrid third-order RBM. In International Conference on Artificial Intel ligenc e and Statistics Luo, H., Carrier, P. L., Courville, A.,, pages and Bengio, Y.689 (2013). Texture mo modeling deling with 470–478. con conv volutional spik spike-and-slab e-and-slab RBMs and deep extensions. In AIST AISTA ATS’2013 . 102 Luo, H., Carrier, P. L., Courville, A., and Bengio, Y. (2013). Texture modeling with AISTIn ATS’2013 Lyu, (2009). In Interpretation terpretation generalization of score matc matching. Pr Pro oceedings conS. volutional spik e-and-slaband RBMs and deep extensions. Inhing. . 102of the Twenty-fifth Confer Conferenc enc encee in Unc Uncertainty ertainty in Artificial Intel Intelligenc ligenc ligencee (UAI’09) (UAI’09).. 621 Lyu, S. (2009). Interpretation and generalization of score matching. In Proceedings of the Twenty-fifth Confer e in Unc Intel ligenc e (UAI’09) Ma, J., Sheridan, R. Penc ., Liaw, A., ertainty Dahl, G.inE.,Artificial and Svetnik, V. (2015). Deep. neural 621 nets as a metho method d for quantitativ quantitativee structure – activity relationships. J. Chemic Chemical al information Ma, J.,mo Sheridan, R. P., Liaw, A., Dahl, G. E., and Svetnik, V. (2015). Deep neural nets and modeling deling deling.. 533 as a method for quantitative structure – activity relationships. J. Chemical information and moL., deling Maas, A. Hannun, improvee neural . 533 A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improv net netw work acoustic mo models. dels. In ICML Workshop on De Deep ep Learning for Audio, Sp Speeech, and Maas, A. L., Pr Hannun, A. Y., and Ng, A. Y. (2013). Rectifier nonlinearities improve neural Language Pro ocessing essing. . 192 network acoustic models. In ICML Workshop on Deep Learning for Audio, Speech, and Language Processing Maass, W. (1992). Bounds for the computational power and learning complexit complexity y of analog . 192 neural nets (extended abstract). In Pr Pro oc. of the 25th ACM Symp. The Theory ory of Computing , Maass, (1992). 198 Bounds for the computational power and learning complexity of analog pagesW. 335–344. neural nets (extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing , Maass, Sc Schnitger, hnitger, Sontag, tag, E. D. (1994). A comparison of the computational pagesW., 335–344. 198 G., and Son power of sigmoid and Boolean threshold circuits. The Theor or oretic etic etical al Advanc dvances es in Neur Neural al Maass, W., Schnitger, and, pages Sontag, E. D. (1994). Computation and LeG., arning 127–151. 198 A comparison of the computational power of sigmoid and Boolean threshold circuits. Theoretical Advances in Neural Computation and Learning Maass, W., Natschlaeger, T.,, and pagesMarkram, 127–151. H. 198(2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neur Neural al Maass, W., Natschlaeger, T., and Markram, H. (2002). Real-time computing without Computation Computation,, 14(11), 2531–2560. 406 stable states: A new framework for neural computation based on perturbations. Neural Computation MacKa MacKay y, D. (2003). Information The Theory, ory, Inferenc enc encee and Learning Algorithms Algorithms.. Cam Cambridge bridge , 14(11), 2531–2560. 406 Infer Univ Universit ersit ersity y Press. 73 MacKay, D. (2003). Information Theory, Inference and Learning Algorithms . Cambridge Maclaurin, Duv Duvenaud, enaud, D., and Adams, R. P. (2015). Gradien Gradient-based t-based hyperparameter UniversitD., y Press. 73 optimization through rev reversible ersible learning. arXiv pr preprint eprint arXiv:1502.03492 . 438 Maclaurin, D., Duvenaud, D., and Adams, R. P. (2015). Gradient-based hyperparameter eprint arXiv:1502.03492 Mao, J., Xu, W.,through Yang, Y., ang, J., Huang, arXiv Z., andprY uille, A. L. (2015). Deep. captioning optimization revW ersible learning. 438 with multimodal recurrent neural net netw works. In ICLR’2015 . arXiv:1410.1090. 102 Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. L. (2015). Deep captioning Marcotte, P. and Sa Sav vard, G. neural (1992).net Nov Novel el approaches to the discrimination problem. with multimodal recurrent works. In ICLR’2015 . arXiv:1410.1090. 102 Zeitschrift für Op Oper er erations ations Rese esear ar arch ch (The (Theory) ory) ory),, 36, 517–545. 276 Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem. Zeitschrift erations esearchAsymptotic (Theory), 36 Marlin, B. andfürdeOp Freitas, N. R (2011). efficiency of deterministic estimators for , 517–545. 276 discrete energy-based mo models: dels: Ratio matching and pseudolikelihoo pseudolikelihood. d. In UAI’2011 . 620, Marlin, 622 B. and de Freitas, N. (2011). Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In UAI’2011 . 620, 622 756
BIBLIOGRAPHY
Marlin, B., Swersky Swersky,, K., Chen, B., and de Freitas, N. (2010). Inductiv Inductivee principles for restricted Boltzmann machine learning. In Pr Pro oceedings of The Thirte Thirteenth enth International Marlin, B., Swersky , K., Chen, B., and de FStatistics reitas, N. (AIST (2010). Inductiv e principles for Confer Conferenc enc ence e on Artificial Intel Intelligenc ligenc ligence e and (AISTA ATS’10) TS’10), , volume 9, pages Pr o c e e dings of The Thirte enth International restricted Boltzmann machine learning. In 509–516. 616, 621, 622 Conference on Artificial Intelligence and Statistics (AISTATS’10) , volume 9, pages Marquardt, (1963). 509–516. D. 616W. , 621 , 622 An algorithm for least-squares estimation of non-linear parameters. Journal of the So Society ciety of Industrial and Applie Applied d Mathematics Mathematics,, 11(2), 431–441. Marquardt, D. W. (1963). An algorithm for least-squares estimation of non-linear param312 eters. Journal of the Society of Industrial and Applied Mathematics , 11(2), 431–441. Marr, Coop op operative erative computation of stereo disparity disparity.. Scienc Sciencee , 312 D. and Poggio, T. (1976). Co 194 194.. 368 Marr, D. and Poggio, T. (1976). Cooperative computation of stereo disparity. Science , 194. 368 Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and M. Littman, editors, Pr Pro oceedings of the Twenty-seventh International Confer Conferenc enc encee on Martens, J. (2010). Deep learning via Hessian-free optimization. In L. Bottou and Machine Learning (ICML-10) (ICML-10),, pages 735–742. ACM. 304 M. Littman, editors, Proceedings of the Twenty-seventh International Conference on MachineJ.Land earning (ICML-10) Martens, Medabalimi, V., pages (2014). On theACM. expressive product duct 735–742. 304 efficiency of sum pro net netw works. arXiv:1411.7717 . 557 Martens, J. and Medabalimi, V. (2014). On the expressive efficiency of sum product arXiv:1411.7717 Martens, J. and Sutskev Sutskever, er, I. (2011). networks orks with Hessian-free networks. . 557 Learning recurrent neural netw optimization. In Pr Pro oc. ICML’2011 . ACM. 415 Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free Proc. ICML’2011 Mase, S. (1995). In Consistency of the maximum pseudo-lik pseudo-likelihoo elihoo elihood d estimator of contin continuous uous optimization. . ACM. 415 state space Gibbsian pro processes. cesses. The Annals of Applie Applied d Pr Prob ob obability ability , 5(3), pp. 603–612. Mase, 619 S. (1995). Consistency of the maximum pseudo-likelihood estimator of continuous state space Gibbsian processes. The Annals of Applied Probability , 5(3), pp. 603–612. McClelland, J., Rumelhart, D., and Hinton, G. (1995). The app appeal eal of parallel distributed 619 pro processing. cessing. In Computation & intel intelligenc ligenc ligencee , pages 305–341. American Asso Association ciation for McClelland, J., Rumelhart, D., and Hinton, G. (1995). The app eal of parallel distributed Artificial In Intelligence. telligence. 17 processing. In Computation & intelligence , pages 305–341. American Association for McCullo McCulloc ch, In W.telligence. S. and Pitts, Artificial 17 W. (1943). A logical calculus of ideas immanent in nervous activit activity y. Bul Bulletin letin of Mathematic Mathematical al Biophysics Biophysics,, 5, 115–133. 14, 15 McCulloch, W. S. and Pitts, W. (1943). A logical calculus of ideas immanent in nervous letin of al Biophysics Mead, C.yand Ismail, M.Mathematic (2012). Analo Analog g VLSI implementation neural activit . Bul , 5, 115–133.of14neur , 15al systems , volume 80. Springer Science & Business Me Media. dia. 454 Mead, C. and Ismail, M. (2012). Analog VLSI implementation of neural systems , volume 80. Melc Melchior, hior, J.,Science Fisc Fischer, her, and Wisk Wiskott, ott, 454 L. (2013). How to center binary deep Boltzmann Springer & A., Business Media. mac machines. hines. arXiv pr preprint eprint arXiv:1311.1354 . 675 Melchior, J., Fischer, A., and Wiskott, L. (2013). How to center binary deep Boltzmann preprintG. arXiv:1311.1354 Memisevic, and Hinton, E. (2007). Unsupervised learning of image transformations. machines.R.arXiv . 675 In Pr Pro oceedings of the Computer Vision and Pattern Recognition Confer Conferenc enc encee (CVPR’07) (CVPR’07).. Memisevic, R. and Hinton, G. E. (2007). Unsupervised learning of image transformations. 688 In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’07). Memisevic, R. and Hinton, G. E. (2010). Learning to represen representt spatial transformations 688 with factored higher-order Boltzmann machines. Neur Neural al Computation , 22(6), 1473–1492. Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations 688 with factored higher-order Boltzmann machines. Neural Computation , 22(6), 1473–1492. 688 757
BIBLIOGRAPHY
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Go Goo odfello dfellow, w, I., Lav Lavoie, oie, E., Muller, X., Desjardins, G., Warde-F arde-Farley arley arley,, D., Vincent, P., Courville, A., and Bergstra, Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., challenge: Bengio, Y.,a deep Goodfello w, I., Lavoie, J. (2011). Unsup Unsupervised ervised and transfer learning learning approac approach. h. E., In Muller, X., Desjardins, G., W arde-F arley , D., Vincent, P ., Courville, A., and Bergstra, JMLR W&CP: Pr Pro oc. Unsup Unsupervise ervise ervised d and Transfer Learning , volume 7. 200, 535, 541 J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In JMLRG., W&CP: Unsupervise d and TrY., ansfer earning , volume Mesnil, Rifai, Pr S.,oc. Dauphin, Y., Bengio, andLVincent, P. (2012). Surfing 7. 200 , 535,on 541the manifold. Learning Workshop, Snowbird. 713 Mesnil, G., Rifai, S., Dauphin, Y., Bengio, Y., and Vincent, P. (2012). Surfing on the Miikkulainen, R. and W Dy Dyer, er, M. G. (1991). Natural language pro processing cessing with mo modular dular manifold. Learning orkshop, Snowbird. 713 PDP net netw works and distributed lexicon. Co Cognitive gnitive Scienc Sciencee , 15, 343–399. 480 Miikkulainen, R. and Dyer, M. G. (1991). Natural language processing with modular Co gnitive e , 15 Mik Mikolo olo olov, v,net T.w(2012). Statistic Statistical al Language Models dels base ased dScienc on Neur Neural al, Networks Networks. Ph.D. thesis, PDP orks and distributed lexicon.Mo 343–399. . 480 Brno Univ Universit ersit ersity y of Technology echnology.. 417 Mikolov, T. (2012). Statistical Language Models based on Neural Networks . Ph.D. thesis, Mik Mikolo olo olov, v,Univ T., ersit Deoras, Kom Kombrink, brink, S., Burget, L., and Cerno Cernocky cky cky,, J. (2011a). Empirical Brno y of A., Technology . 417 ev evaluation aluation and combination of adv advanced anced language mo modeling deling techniques. In Pr Pro oc. 12th anMik olov, T., enc Deoras, A.,international Kombrink, S., L., and Cerno cky , J. (2011a). Empirical nual confer onferenc encee of the sp speeBurget, ech communic ommunication ation asso association ciation (INTERSPEECH Pr o c. 12th anev aluation and combination of adv anced language mo deling techniques. In 2011) 2011).. 475 nual conference of the international speech communication association (INTERSPEECH 2011) Mik Mikolo olo olov, v,. T., Pov vey ey,, D., Burget, L., and Cerno Cernocky cky cky,, J. (2011b). Strategies for 475Deoras, A., Po training large scale neural netw network ork language models. In Pr Pro oc. ASRU’2011 . 329, 475 Mikolov, T., Deoras, A., Povey, D., Burget, L., and Cernocky, J. (2011b). Strategies for Proc. ASRU’2011 Mik Mikolo olo olov, v, T.,large Chen, K., neural Corrado, G.,ork and Dean, J.models. (2013a).InEfficient estimation .of329 word reptraining scale netw language , 475 resen resentations tations in vector space. In International Confer Conferenc enc encee on Learning Repr epresentations: esentations: Mik v, T., Chen, Corrado, G., and Dean, J. (2013a). Efficient estimation of word repWolo orkshops TrackK., . 539 resentations in vector space. In International Conference on Learning Representations: Wolo orkshops Track Mik Mikolo olov, v, T., Le, Q. V., and Sutskev Sutskever, er, I. (2013b). Exploiting similarities among languages . 539 for mac machine hine translation. Technical rep report, ort, arXiv:1309.4168. 542 Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting similarities among languages Mink Minka, T. hine (2005). Div Divergence ergenceTmeasures andort, message passing. Micr Microsoft osoft Rese esear ar arch ch Cambridge fora,mac translation. echnical rep arXiv:1309.4168. 542 UK Tech Rep MSR MSRTR2005173 TR2005173 , 72(TR-2005-173). 628 Minka, T. (2005). Divergence measures and message passing. Microsoft Research Cambridge UK T ch L. Repand MSR TR2005173 Minsky Minsky, , eM. Papert, S. A. (1969). Per Percceptr eptrons ons ons..628 MIT Press, Cambridge. 15 , 72(TR-2005-173). Perceptr ons . MIT Mirza, S. (2014). Conditional generative adversarial nets. arXiv15pr preprint eprint MinskyM. , M.and L. Osindero, and Papert, S. A. (1969). Press, Cambridge. arXiv:1411.1784 . 703 Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 Mishkin, D. and Matas, good d init. arXiv pr preprint eprint . 703 J. (2015). All you need is a goo arXiv:1511.06422 . 305 Mishkin, D. and Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422 Misra, J. and Saha, .I.305 (2010). Artificial neural netw networks orks in hardware: A surv survey ey of tw two o decades of progress. Neur Neuro ocomputing omputing,, 74(1), 239–255. 454 Misra, J. and Saha, I. (2010). Artificial neural networks in hardware: A survey of two ocomputing Mitc Mitchell, hell, T.ofM. (1997). Neur Machine Learning arning. . (1), McGra McGraw-Hill, w-Hill,454 New York. 99 decades progress. , 74 239–255. arning Miy Miyato, Maeda, S., Machine Koyama,LeM., Nak Nakae, K., w-Hill, and Ishii, (2015). Mitcato, hell,T., T. M. (1997). . ae, McGra NewS.York. 99 Distributional smo smoothing othing with virtual adversarial training. In ICLR. Preprin Preprint: t: arXiv:1507.00677. 268 Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. (2015). Distributional smoothing with virtual adversarial training. In ICLR. Preprint: arXiv:1507.00677. 268 758
BIBLIOGRAPHY
Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief net netw works. In ICML’2014 . 693, 695 Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief ICML’2014 Mnih, andInHin Hinton, ton, G. E.. (2007). models dels for statistical language netwA. orks. 693, 695Three new graphical mo mo modelling. delling. In Z. Ghahramani, editor, Pr Pro oceedings of the Twenty-fourth International Mnih, A.enc ande on HinMachine ton, G. E.Le(2007). Three new ,graphical modelsACM. for statistical language Confer Conferenc ence arning (ICML’07) (ICML’07), pages 641–648. 467 modelling. In Z. Ghahramani, editor, Proceedings of the Twenty-fourth International Confer e on Machine Le(2009). arning (ICML’07) Mnih, A.enc and Hinton, G. E. A scalable ,hierarchical distributed language mo model. del. pages 641–648. ACM. 467 In D. Koller, D. Sch Schuurmans, uurmans, Y. Bengio, and L. Bottou, editors, Advanc dvances es in Neur Neural al Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language mo del. Information Pr Pro ocessing Systems 21 (NIPS’08) (NIPS’08),, pages 1081–1088. 470 In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS’08) Mnih, A. and Kavuk Kavukcuoglu, cuoglu, K. (2013). Learning ,wpages ord em embeddings beddings efficiently with noise1081–1088. 470 con contrastiv trastiv trastivee estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and Mnih, A. and Kavuk cuoglu, K. (2013). Learning word embeddings efficiently noiseK. W einberger, editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systemswith 26 , pages con trastiv e estimation. In C. Burges, L. Bottou, M. W elling, Z. Ghahramani, and 2265–2273. Curran Associates, Inc. 475, 625 K. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages Mnih, A. andCurran Teh, Y. W. (2012). fast and simple algorithm for training neural 2265–2273. Associates, Inc. A 475 , 625 probabilistic language models. In ICML’2012 , pages 1751–1758. 475 Mnih, A. and Teh, Y. W. (2012). A fast and simple algorithm for training neural Mnih, V. and Hin Hinton, ton, G. (2010). to detect roads1751–1758. in high-resolution aerial images. probabilistic language models. Learning In ICML’2012 , pages 475 In Pr Pro oceedings of the 11th Eur Europ op opeean Confer Conferenc enc encee on Computer Vision (ECCV) (ECCV).. 102 Mnih, V. and Hinton, G. (2010). Learning to detect roads in high-resolution aerial images. oceeLaro dingschelle, of the H., 11thand EurHinton, opean Confer ence onConditional Computer Vision (ECCV) Mnih, Larochelle, G. (2011). restricted Boltzmann In PrV., . 102 mac machines hines for structure output prediction. In Pr Pro oc. Conf. on Unc Uncertainty ertainty in Artificial Mnih, V., Laro chelle, H., and Hinton, G. (2011). Conditional restricted Boltzmann Intel Intelligenc ligenc ligence e (UAI) (UAI). . 687 machines for structure output prediction. In Proc. Conf. on Uncertainty in Artificial IntelV., ligenc e (UAI) Mnih, Kavuk Kavukcuoglo, cuoglo, K., Silv Silver, er, D., Grav Graves, es, A., Antonoglou, I., and Wierstra, D. (2013). . 687 Pla Playing ying Atari with deep reinforcement learning. Technical rep report, ort, arXiv:1312.5602. 106 Mnih, V., Kavukcuoglo, K., Silver, D., Graves, A., Antonoglou, I., and Wierstra, D. (2013). Mnih, V., Heess, N., Grav Graves, A., and Kavuk Kavukcuoglu, cuoglu,TK. (2014).rep Recurren Recurrent t models of visual Playing Atari with deepes, reinforcement learning. echnical ort, arXiv:1312.5602. 106 atten attention. tion. In Z. Ghahramani, M. Welling, C. Cortes, N. La Lawrence, wrence, and K. Wein einb berger, Mnih, V., Heess, N., Grav es, A., and Kavuk cuoglu, K. (2014). Recurrent models of visual editors, NIPS’2014 , pages 2204–2212. 693 attention. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, NIPS’2014 Mnih, V., Kavuk Kavukcuoglo, cuoglo, K., Silv Silver, er, D., Rusu, Graves, es, editors, , pages 2204–2212. 693 A. A., Veness, J., Bellemare, M. G., Grav A., Riedmiller, M., Fidgeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Mnih, V., Kavuk K., Kumaran, Silver, D., D., Rusu, A. A., D., Veness, M. G., es, An Antonoglou, tonoglou, I.,cuoglo, King, H., Wierstra, Legg,J.,S.,Bellemare, and Hassabis, D. Grav (2015). A., Riedmiller, M., A. K., Ostrovski, tG., Petersen, S., Beattie, C., Sadik,25 A., Human-lev Human-level el con control trolFidgeland, through deep reinforcemen reinforcement learning. Natur Nature e , 518, 529–533. Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Nature , of 518 Mobahi, H. and Fisher, II III, I, J. deep W. (2015). A theoretical optimization by Human-lev el con trol through reinforcemen t learning.analysis , 529–533. 25 Gaussian con contin tin tinuation. uation. In AAAI’2015 . 328 Mobahi, H. and Fisher, III, J. W. (2015). A theoretical analysis of optimization by Mobahi, H., con Collob Collobert, ert, R., In andAAAI’2015 Weston, J.. (2009). Deep learning from temp temporal oral coherence Gaussian tinuation. 328 in video. In L. Bottou and M. Littman, editors, Pr Pro oceedings of the 26th International Mobahi, H., ert, R., and Weston, J. (2009). Deep learningOmnipress. from temporal Confer Conferenc enc enceeCollob on Machine Learning arning, , pages 737–744, Montreal. 497coherence in video. In L. Bottou and M. Littman, editors, Proceedings of the 26th International Conferenc e on Machine arning ,G. Mohamed, A., Dahl, G., andLeHinton, (2009). Deep bMontreal. elief netw networks orks for phone recognition. pages 737–744, Omnipress. 497 462 Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. 462 759
BIBLIOGRAPHY
Mohamed, A., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., and Pichen Picheny y, M. A. (2011). Deep belief netw networks orks using discriminative features for phone recognition. In Mohamed, T. N., Pr Dahl, G., Ramabhadran, Hinton, G. E., and Pichen ye, Acoustics,A., Sp SpeSainath, eech and Signal Pro ocessing (ICASSP), 2011B., IEEE International Confer Conferenc enc ence M. A. (2011). Deep b elief netw orks using discriminative features for phone recognition. In on on,, pages 5060–5063. IEEE. 462 Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference Mohamed, Dahl, G.,IEEE. and Hinton, G. (2012a). Acoustic modeling using deep belief on , pagesA., 5060–5063. 462 net netw works. IEEE Trans. on Audio, Sp Speeech and Language Pr Pro ocessing , 20 (1), 14–22. 462 Mohamed, A., Dahl, G., and Hinton, G. (2012a). Acoustic modeling using deep belief Mohamed, Hinton, G., and Penn, G. Understanding how deep belief netw networks orks Trans. on Audio, Spe(2012b). ech and L anguage Processing networks.A.,IEEE , 20 (1), 14–22. 462 perform acoustic mo modelling. delling. In Acoustics, Sp Speeech and Signal Pr Pro ocessing (ICASSP), Mohamed, A., International Hinton, G., and Penn, G. (2012b). how deep 2012 IEEE Confer Conferenc enc ence e on , pagesUnderstanding 4273–4276. IEEE. 462 belief networks A c oustics, Sp e e ch and Signal Pr o cessing (ICASSP), perform acoustic modelling. In Moller, F. (1993). A scaled conjugate algorithm for fast sup supervised 2012 M. IEEE International Confer ence ongradient , pages 4273–4276. IEEE. 462ervised learning. Neur Neural al Networks , 6, 525–533. 316 Moller, M. F. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Mon Monta ta tav val on,Networks G. and ,Muller, K.-R.316 (2012). Deep Boltzmann machines and the cen centering tering Neur 6, 525–533. tric trick. k. In G. Monta Montav von, G. Orr, and K.-R. Müller, editors, Neur Neural al Networks: Tricks of Mon and Muller, (2012). Boltzmann machines and the cen tering thetaTvron, adeG. , volume 7700 ofK.-R. Lectur cture e Notes Deep in Computer Scienc Science e , pages 621–637. Preprint: Neur al Networks: T ricks of tric k. In G. Montavon, G. Orr, and h ttp://arxiv.org/abs/1203.3783. 675K.-R. Müller, editors, the Trade , volume 7700 of Lecture Notes in Computer Science , pages 621–637. Preprint: Mon Montúfar, túfar, G. (2014). Universal appro approximation ximation depth and errors of narro narrow w belief net netw works http://arxiv.org/abs/1203.3783. 675 with discrete units. Neur Neural al Computation Computation,, 26. 556 Montúfar, G. (2014). Universal approximation depth and errors of narrow belief networks al Computation Mon Montúfar, túfar, G. and Ay,Neur N. (2011). Refinements of universal approximation results for with discrete units. , 26. 556 deep belief net netw works and restricted Boltzmann machines. Neur Neural al Computation Computation,, 23 (5), 23(5), Mon túfar, G. 556 and Ay, N. (2011). Refinements of universal approximation results for 1306–1319. deep belief networks and restricted Boltzmann machines. Neural Computation , 23(5), Mon Montufar, tufar, G. F., Pascanu, u, R., Cho, K., and Bengio, Y. (2014). On the num number ber of linear 1306–1319. 556Pascan regions of deep neural netw networks. orks. In NIPS’2014 . 19, 199 Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear NIPS’2014 Mor-Y Mor-Yosef, osef,ofS.,deep Samueloff, A., Mo Modan, dan,InB., Nav Navot, ot, D., and Sc Schenk henk henker, er, J. G. (1990). Ranking regions neural netw orks. . 19 , 199 the risk factors for cesarean: logistic regression analysis of a nationwide study study.. Obstet Mor-Y osef, Samueloff, Gyne Gynec col , S., 75(6), 944–7. A., 3 Modan, B., Navot, D., and Schenker, J. G. (1990). Ranking the risk factors for cesarean: logistic regression analysis of a nationwide study. Obstet Morin, Bengio, network ork language GynecF. ol ,and 75(6), 944–7.Y.3 (2005). Hierarchical probabilistic neural netw mo model. del. In AIST AISTA ATS’2005 . 470, 472 Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language ATS’2005 Mozer, M.In C.AIST (1992). The induction of multiscale temp temporal oral structure. In J. M. S. Hanson model. . 470, 472 and R. Lippmann, editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 4 Mozer, M. C., (1992). The induction of multiscale temporal structure. In J. M. S. Hanson (NIPS’91) (NIPS’91), pages 275–282, San Mateo, CA. Morgan Kaufmann. 410 and R. Lippmann, editors, Advances in Neural Information Processing Systems 4 Murph Murphy y, K. ,P P. . (2012). Machine Learning: Pr Prob ob obabilistic abilistic Persp Perspe ective . MIT Press, (NIPS’91) pages 275–282, San Mateo, CA. aMorgan Kaufmann. 410 Cam Cambridge, bridge, MA, USA. 62, 98, 145 Murphy, K. P. (2012). Machine Learning: a Probabilistic Perspective . MIT Press, Murra Murray , B. U. MA, I. andUSA. Laro Larochelle, chelle, Camybridge, 62 , 98, H. 145(2014). A deep and tractable density estimator. In ICML’2014 . 189, 712 Murray, B. U. I. and Larochelle, H. (2014). A deep and tractable density estimator. In Nair, V. and Hinton, G. (2010). Rectified linear units improv improvee restricted Boltzmann ICML’2014 . 189, 712 mac machines. hines. In ICML’2010 . 16, 173, 196 Nair, V. and Hinton, G. (2010). Rectified linear units improve restricted Boltzmann machines. In ICML’2010 . 16, 173, 196 760
BIBLIOGRAPHY
Nair, V. and Hinton, G. E. (2009). 3d ob object ject recognition with deep belief nets. In Y. Bengio, D. Sch Schuurmans, uurmans, J. D. Laffert Lafferty y, C. K. I. Williams, and A. Culotta, editors, Advanc dvances es in Nair, V.aland Hinton, G.Pr E.o(2009). ob ject 22 recognition with deep bCurran elief nets. Inciates, Y. Bengio, Neur Neural Information Pro cessing 3d Systems , pages 1339–1347. Asso Associates, Inc. A dvanc es in D. Sch uurmans, J. D. Laffert y , C. K. I. Williams, and A. Culotta, editors, 688 Neural Information Processing Systems 22 , pages 1339–1347. Curran Associates, Inc. Nara Naray complexity y of testing the manifold hypothesis. 688yanan, H. and Mitter, S. (2010). Sample complexit In NIPS’2010 . 163 Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. Naumann, U. (2008). accumulation ulation is NP-complete. Mathematic Mathematical al In NIPS’2010 . 163 Optimal Jacobian accum Pr Pro ogr gramming amming , 112(2), 427–441. 221 Naumann, U. (2008). Optimal Jacobian accumulation is NP-complete. Mathematical Pr ogramming Na Navigli, vigli, R. and ,V112 elardi, . (2005). 221 Structural seman semantic tic interconnections: a kno knowledgewledge(2), P427–441. based approach to word sense disambiguation. IEEE Trans. Pattern Analysis and NaMachine vigli, R. Intel and ligenc Velardi, P.(7), (2005). Structural Intelligenc ligence e , 27 1075––1086. 487semantic interconnections: a knowledgebased approach to word sense disambiguation. IEEE Trans. Pattern Analysis and Machine ligencG. e , 27 Neal, R. andIntel Hinton, (1999). A view of the (7), 1075––1086. 487EM algorithm that justifies incremental, sparse, and other varian ariants. ts. In M. I. Jordan, editor, Learning in Gr Graphic aphic aphical al Mo Models dels . MIT Neal, R. Cam and Hinton, G. (1999). Press, Cambridge, bridge, MA. 637 A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models . MIT Neal, R. M. (1990). stochastic chastic feedforward netw networks. orks. Tec echnical hnical rep report. ort. 694 Press, Cam bridge,Learning MA. 637sto Neal, R. M. M. (1990). (1993). Learning Probabilistic inference using Marko Markov chain Monte-Carlo methods. Neal, R. stochastic feedforward netwvorks. Tec hnical report.metho 694 ds. Technical Rep Report ort CRG-TR-93-1, Dept. of Computer Science, Univ Universit ersit ersity y of Toronto. 682 Neal, R. M. (1993). Probabilistic inference using Markov chain Monte-Carlo methods. Neal, R. M. Rep (1994). Sampling fromDept. multimodal distributions temp tempered transitions. Technical ort CRG-TR-93-1, of Computer Science,using Universit yered of Toronto. 682 Technical Rep 9421, Dept. of Statistics, Universit of T oronto. 606 Report ort University y Neal, R. M. (1994). Sampling from multimodal distributions using tempered transitions. Neal, R. M. (1996). Bayesian Learning for Neur Neural al Networks Lecture606 Notes in Statistics. Technical Report 9421, Dept. of Statistics, Universit y of T. oronto. Springer. 264 Neal, R. M. (1996). Bayesian Learning for Neural Networks . Lecture Notes in Statistics. 11(2), Neal, R. M. 264 (2001). Annealed imp importance ortance sampling. Statistics and Computing , 11 (2), Springer. 125–139. 628, 630, 631, 632 Neal, R. M. (2001). Annealed importance sampling. Statistics and Computing , 11(2), Neal, R. M. 628 (2005). constants ts using linked importance 125–139. , 630,Estimating 631, 632 ratios of normalizing constan sampling. 632 Neal, R. M. (2005). Estimating ratios of normalizing constants using linked importance Nestero Nesterov, v, Y. (1983). A metho method d of solving a conv convex ex programming problem with conv convergence ergence sampling. 632 rate O(1/k 2 ). Soviet Mathematics Doklady , 27, 372–376. 300 Nesterov, Y. (1983). A method of solving a convex programming problem with convergence /k(2004). ). Soviet Mathematics Doklady Nestero Nesterov, v,(1Y. Intr Intro oductory le lectur ctur ctures es on convex optimization ourse.. Applied rate O , 27, 372–376. 300: a basic course optimization. Klu Kluw wer Academic Publ., Boston, Dordrech Dordrecht, t, London. 300 Nesterov, Y. (2004). Introductory lectures on convex optimization : a basic course . Applied Netzer, Y., Wang, A., Bissacco, A., WDordrech u, B., and Ng, A. Y.300 (2011). Reading optimization. KluT., werCoates, Academic Publ., Boston, t, London. digits in natural images with unsup unsupervised ervised feature learning. Deep Learning and Netzer, Y., Wang, T., Coates, A., W Bissacco, Wu, 21 B., and Ng, A. Y. (2011). Reading Unsup Feature Learning orkshop,A., NIPS. Unsupervised ervised digits in natural images with unsupervised feature learning. Deep Learning and Ney Ney, , H. and Kneser, R. (1993). Impro Improv ved clustering for class-based statistical Unsup ervised Feature Learning Workshop, NIPS. techniques 21 language mo modelling. delling. In Eur Europ op opeean Confer Conferenc enc encee on Sp Speeech Communic Communication ation and Technolo chnology gy Ney , H. and Kneser, R. (1993). Impro v ed clustering techniques for class-based statistical (Eur (Eurosp osp ospeeech) ch),, pages 973–976, Berlin. 466 language modelling. In European Conference on Speech Communication and Technology (Eurospeech), pages 973–976, Berlin. 466761
BIBLIOGRAPHY
Ng, A. (2015). Advice for applying mac machine hine learning. https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf.. 424 Ng, A. (2015). Advice for applying machine learning. https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf Niesler, T. R., Whittaker, E. W. D., and Woodland, P. C. (1998). Comparison . 424 of part-ofsp speec eec eech h and automatically deriv derived ed category-based language mo models dels for sp speec eec eech h recognition. Niesler, T. R., Whittaker, E. W. D.,Aand Woodland, C. (1998). Comparison of part-of-, In International Confer Conferenc enc ence e on coustics, Sp SpeeechP.and Signal Pr Pro ocessing (ICASSP) (ICASSP), sp eec h and automatically deriv ed category-based language mo dels for sp eec h recognition. pages 177–180. 466 In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Ning, F.,177–180. Delhomme, pages 466 D., LeCun, Y., Piano, F., Bottou, L., and Barbano, P. E. (2005). Tow oward ard automatic phenotyping of dev developing eloping embry embryos os from videos. Image Pr Pro ocessing, Ning, F.,TDelhomme, Y., Piano, 361 F., Bottou, L., and Barbano, P. E. (2005). IEEE ransactions D., on on,, LeCun, 14(9), 1360–1371. Toward automatic phenotyping of developing embryos from videos. Image Processing, IEEE onS. No Nocedal, cedal, TJ.ransactions and Wright, (2006). Numeric Numerical al361 Optimization Optimization.. Springer. 92, 95 , 14 (9), 1360–1371. al loss Optimization Norouzi, andW Fleet, (2011).Numeric Minimal hashing for compact 92 binary Nocedal, M. J. and right,D. S. J.(2006). . Springer. , 95 codes. In ICML’2011 . 528 Norouzi, M. and Fleet, D. J. (2011). Minimal loss hashing for compact binary codes. In ICML’2011 No Nowlan, wlan, S. J. .(1990). Competing eting exp experts: erts: An exp experimental erimental inv investigation estigation of asso associativ ciativ ciativee 528 Comp mixture models. Technical Rep Report ort CRG-TR-90-5, Universit University y of Toronto. 453 Nowlan, S. J. (1990). Competing experts: An experimental investigation of associative No Nowlan, wlan, S. models. J. and Hinton, G. E.Rep (1992). Simplifying neural netw networks orks byoronto. soft weigh weight-sharing. mixture Technical ort CRG-TR-90-5, Universit y of T 453t-sharing. Neur Neural al Computation Computation,, 4(4), 473–493. 139 Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weight-sharing. Neural Computation Olshausen, B. and Field, D. J.473–493. (2005). 139 How close are we to understanding V1? Neur Neural al , 4(4), Computation Computation,, 17, 1665–1699. 16 Olshausen, B. and Field, D. J. (2005). How close are we to understanding V1? Neural Computation 17, 1665–1699. Olshausen, B. A., and Field, D. J. (1996). Emergence of simple-cell receptive field prop properties erties 16 by learning a sparse co code de for natural images. Natur Naturee , 381, 607–609. 146, 254, 371, 499 Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties Nature ,D. 381 Olshausen, B.aA., Anderson, H., and Van Essen, C., 607–609. (1993). A by learning sparse code forC. natural images. 146neurobiological , 254, 371, 499 mo model del of visual atten attention tion and in inv variant pattern recognition based on dynamic routing Olshausen, B. A.,J.Anderson, and Van Essen, of information. Neur Neurosci. osci. osci.,,C. 13H., (11), 4700–4719. 453D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing J. Neur osci. Opp Opper, M. and Archam Archambeau, beau, C., 13 (2009). variational approximation ximation revisited. ofer, information. (11), The 4700–4719. 453Gaussian appro Neur Neural al computation omputation,, 21(3), 786–792. 691 Opper, M. and Archambeau, C. (2009). The variational Gaussian approximation revisited. Neural computation 21(3), 786–792. Oquab, M., Bottou, L.,, Laptev, I., and Sivic, 691 J. (2014). Learning and transferring mid-level image represen representations tations using conv convolutional olutional neural netw networks. orks. In Computer Vision and Oquab, M.,RBottou, L., (CVPR), Laptev, I.,2014 and Sivic, (2014). Learning and1717–1724. transferringIEEE. mid-level Pattern ecognition IEEE J. Confer Conferenc enc ence e on , pages 539 image representations using convolutional neural networks. In Computer Vision and Pattern S. Reand cognition (CVPR), 2014 IEEE Confer ence patches on , pages Osindero, Hinton, G. E. (2008). Modeling image with a directedIEEE. hierarc hierarch hy 1717–1724. 539 of Marko Markov v random fields. In J. Platt, D. Koller, Y. Singer, and S. Row Roweis, eis, editors, Osindero, S. in andNeur Hinton, G. E. (2008). image patches with a, directed hierarchy Advanc dvances es Neural al Information Pr Pro oModeling cessing Systems 20 (NIPS’07) (NIPS’07), pages 1121–1128, of Marko v random fields. In 635 J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Cam Cambridge, bridge, MA. MIT Press. Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1121–1128, Ovid and Martin, (2004). Metamorphoses . W.W. Norton. 1 Cam bridge, MA.C.MIT Press. 635 Ovid and Martin, C. (2004). Metamorphoses . W.W. Norton. 1 762
BIBLIOGRAPHY
Paccanaro, A. and Hin Hinton, ton, G. E. (2000). Extracting distributed representations of concepts and relations from positive and negative prop propositions. ositions. In International Joint Confer Conferenc enc encee Paccanaro, and Hinton, G. E. ,(2000). distributed representations of concepts on Neur Neural alA.Networks (IJCNN) (IJCNN), Como,Extracting Italy Italy.. IEEE, New York. 487 and relations from positive and negative propositions. In International Joint Conference on Neur al Khorrami, Networks (IJCNN) Paine, T. L., P., Han, ,W., and Huang, T. S. New (2014). An analysis of unsup unsupervised ervised Como, Italy. IEEE, York. 487 pre-training in ligh lightt of recent adv advances. ances. arXiv pr preprint eprint arXiv:1412.6597 . 535 Paine, T. L., Khorrami, P., Han, W., and Huang, T. S. (2014). An analysis of unsupervised preprint arXiv:1412.6597 Palatucci, M., in Pomerleau, D., Hin Hinton, E., and Mitchell, T. M. (2009). pre-training light of recent advton, ances.G.arXiv . 535Zero-shot learning with semantic output co codes. des. In Y. Bengio, D. Sch Schuurmans, uurmans, J. D. Lafferty Lafferty,, Palatucci, M., Pomerleau, Hinton, G. E., and es Mitchell, T. Information M. (2009). Pr Zero-shot C. K. I. Williams, and A. D., Culotta, editors, Advanc dvances in Neur Neural al Pro ocessing learning with semantic output co des. In Y. Bengio, D. Sch uurmans, J. D. Lafferty , Systems 22 , pages 1410–1418. Curran Asso Associates, ciates, Inc. 542 A dvanc es in Neur al Information Pr o c essing C. K. I. Williams, and A. Culotta, editors, Systems 22 (1985). Park arker, er, D. B. Technical Rep Report ortInc. TR-47, Center ter for Comp. Research , pages Learning-logic. 1410–1418. Curran Associates, 542 Cen in Economics and Managemen Managementt Sci., MIT. 225 Parker, D. B. (1985). Learning-logic. Technical Report TR-47, Center for Comp. Research Pascan ascanu, u, R., Mik Mikolo olo olov, v,Managemen T., and Bengio, (2013a). recurrentt in Economics and t Sci.,Y. MIT. 225 On the difficulty of training recurren neural net netw works. In ICML’2013 . 289, 403, 406, 410, 417, 419 Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent Pascan ascanu, u, net R.,wMon Montufar, tufar, G., and Bengio, the, 419 num number ber of inference regions neural orks. In ICML’2013 . 289, Y. 403(2013b). , 406, 410On , 417 of deep feed forward netw networks orks with piece-wise linear activ activations. ations. Technical rep report, ort, U. Pascan u, R., arXiv:1312.6098. Montufar, G., and198 Bengio, Y. (2013b). On the number of inference regions Mon Montreal, treal, of deep feed forward networks with piece-wise linear activations. Technical report, U. Pascan ascanu, u, R.,arXiv:1312.6098. Gülçehre, Ç., Cho, Montreal, 198 K., and Bengio, Y. (2014a). How to construct deep recurren recurrentt neural net netw works. In ICLR’2014 . 19, 199, 264, 399, 400, 401, 413, 463 Pascanu, R., Gülçehre, Ç., Cho, K., and Bengio, Y. (2014a). How to construct deep Pascan Mon G., and Y. (2014b). num of inference ICLR’2014 ascanu, u, R., Montufar, tufar, number ber, 401 recurren t neural networks. In Bengio, . 19, 199,On 264the , 399 , 400 , 413, 463regions of deep feed forw forward ard netw networks orks with piece-wise linear activ activations. ations. In ICLR’2014 . 553 Pascanu, R., Montufar, G., and Bengio, Y. (2014b). On the number of inference regions ICLR’2014pursuit: Pati, Y., Rezaiifar, R.,netw andorks Krishnaprasad, P. (1993). Orthogonal of deep feed forward with piece-wise linear activ ations. In matching . 553 Recursiv Recursivee function appro approximation ximation with applications to wa wav velet decomposition. In Pr ProoPati, Y., Rezaiifar, R.,Annual and Krishnaprasad, Penc . (1993). Orthogonal matching pursuit:, ceedings of the 27 th Asilomar Confer Conferenc ence e on Signals, Systems, and Computers Recursiv e function pages 40–44. 254 approximation with applications to wavelet decomposition. In Proceedings of the 27 th Annual Asilomar Conference on Signals, Systems, and Computers , Pearl, (1985).254 Ba Bay yesian netw networks: orks: A mo model del of self-activ self-activated ated memory for evidential pagesJ.40–44. reasoning. In Pr Pro oceedings of the 7th Confer Conferenc enc encee of the Co Cognitive gnitive Scienc Sciencee So Society, ciety, Pearl, J. (1985). Ba y esian netw orks: A mo del of self-activ ated memory for evidential University of California, Irvine Irvine,, pages 329–334. 566 reasoning. In Proceedings of the 7th Conference of the Cognitive Science Society, University of California, Irvine Pearl, J. (1988). Pr Prob ob obabilistic abilistic Reasoning in Intel Intelligent ligent , pages 329–334. 566 Systems: Networks of Plausible Infer Inferenc enc encee . Morgan Kaufmann. 54 Pearl, J. (1988). Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible Inferenc Perron, O. e(1907). ZurKaufmann. theorie der 54 matrices. Mathematische Annalen , 64(2), 248–263. 600 . Morgan Annalen P B. andZur Pedersen, M. S. (2006). Mathematische The matrix cookb cookbo ook. V ersion 20051003.600 31 Petersen, erron, O.K. (1907). theorie der matrices. , 64 (2), 248–263. Peterson, G. B. B. and (2004). A da of great illumination: B. F.cookb Skinner’s of shaping. day y M. discov very P etersen, K. Pedersen, S. (2006). The matrix ook. Vdisco ersion 20051003. 31 Journal of the Exp Experimental erimental Analysis of Behavior , 82(3), 317–328. 329 Peterson, G. B. (2004). A day of great illumination: B. F. Skinner’s discovery of shaping. Journal of the Experimental Analysis Behavior Pham, D.-T., Garat, P., and Jutten, C. of (1992). Separation a mixture , 82(3), of 317–328. 329of independent sources through a maxim maximum um likelihoo likelihood d approach. In EUSIPCO , pages 771–774. 494 Pham, D.-T., Garat, P., and Jutten, C. (1992). Separation of a mixture of independent 763 sources through a maximum likelihood approach. In EUSIPCO , pages 771–774. 494
BIBLIOGRAPHY
Pham, P.-H., Jelaca, D., Farab arabet, et, C., Martini, B., LeCun, Y., and Culurciello, E. (2012). NeuFlo NeuFlow: w: dataflow vision pro processing cessing system-on-a-chip. In Cir Circuits cuits and Systems (MWSPham, P.-H., D., FInternational arabet, C., Martini, B.,Symp LeCun, Y.,onand Culurciello, E. (2012). CAS), 2012Jelaca, IEEE 55th Midwest Symposium osium , pages 1044–1047. IEEE. Cir cuits and Systems (MWSNeuFlo w: dataflow vision pro cessing system-on-a-chip. In 454 CAS), 2012 IEEE 55th International Midwest Symposium on , pages 1044–1047. IEEE. Pinheiro, P. H. O. and Collob Collobert, ert, R. (2014). Recurrent con conv volutional neural net netw works for 454 scene labeling. In ICML’2014 . 360 Pinheiro, P. H. O. and Collobert, R. (2014). Recurrent convolutional neural networks for ICML’2014 Pinheiro, P. H. O.In and Collob Collobert, ert,. R. pixel-level el lab labeling eling with scene labeling. 360(2015). From image-level to pixel-lev con conv volutional net netw works. In Confer Conferenc enc encee on Computer Vision and Pattern Recognition Pinheiro, P. H. (CVPR) (CVPR). 360O. and Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR) Pin Pinto, to, N., Cox, object ject recognition . 360D. D., and DiCarlo, J. J. (2008). Why is real-world visual ob hard? PL PLoS oS Comput Biol , 4. 459 Pinto, N., Cox, D. D., and DiCarlo, J. J. (2008). Why is real-world visual ob ject recognition oS Comput Biol , 4T., Pin Pinto, to, N.,PL Stone, Z., Zickler, and Cox, D. (2011). Scaling up biologically-inspired hard? . 459 computer vision: A case study in unconstrained face recognition on faceb facebo ook. In Pin to, N., Stone, Z., Zickler, T., and Cox, D. (2011). Scaling up biologically-inspired Computer Vision and Pattern Recognition Workshops (CVPR (CVPRW), W), 2011 IEEE Computer computer vision: case study in unconstrained face recognition on facebook. In So Society ciety Confer Conferenc enc enceeAon on, , pages 35–42. IEEE. 364 Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer So ciety ence Recursiv on , pages 46(1), Pollac ollack, k, J.Confer B. (1990). Recursive e distributed represen representations. Intelligenc ligenc ligencee , 46 (1), 35–42. IEEE. 364 tations. Artificial Intel 77–105. 401 Pollack, J. B. (1990). Recursive distributed representations. Artificial Intel ligence , 46(1), Poly olyak, ak, B. and Juditsky,, A. (1992). Acceleration of sto stochastic chastic approximation by av averaging. eraging. 77–105. 401 Juditsky SIAM J. Contr Control ol and Optimization Optimization,, 30(4), 838–855. 323 Polyak, B. and Juditsky, A. (1992). Acceleration of stochastic approximation by averaging. SIAM Contr ol and Optimization Poly olyak, ak, B.J.T. (1964). Some metho methods ds of, 30(4) sp speeding eeding up the conv convergence , 838–855. 323 ergence of iteration methods. USSR Computational Mathematics and Mathematic Mathematical al Physics Physics,, 4(5), 1–17. 296 Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. Mathematics and Mathematic Physics , 4noise PoUSSR ole, B.,Computational Sohl-Dickstein, J., and Ganguli, S. (2014). alAnalyzing in auto autoenco enco encoders ders (5), 1–17. 296 and deep net netw works. CoRR, abs/1406.1831. 241 Poole, B., Sohl-Dickstein, J., and Ganguli, S. (2014). Analyzing noise in autoencoders Poand on, H. andnet Domingos, P. (2011). Sum-product netw networks: orks: A new deep architecture. In deep works. CoRR , abs/1406.1831 . 241 Pr c e e dings of the Twenty-seventh Confer in Unc Pro o Conferenc enc encee Uncertainty ertainty in Artificial Intel Intelligenc ligenc ligencee Po(UAI) on, H. and Domingos, P . (2011). Sum-product netw orks: A new deep architecture. In (UAI),, Barcelona, Spain. 557 Proceedings of the Twenty-seventh Conference in Uncertainty in Artificial Intel ligence (UAI) Presley Presley,, R. K. and Haggard, , Barcelona, Spain. R. 557L. (1994). A fixed point implementation of the backpropagation learning algorithm. In Southe Southeastc astc astcon on on’94. ’94. Cr Creeative Technolo chnology gy Transfer-A Glob Global al Presley , R.Pr K.ocand Haggard, L. (1994). fixed136–138. point implementation Affair., Pro eedings of theR. 1994 IEEE , A pages IEEE. 454 of the backpropagation learning algorithm. In Southeastcon’94. Creative Technology Transfer-A Global Affair., Proceedings of the 1994 IEEE Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE , pages 136–138. IEEE. 454 Transactions on Information The Theory ory ory,, 4(2), 69–72. 691 Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IEEE Transactions Information Theory Quiroga, R. Q., on Reddy Reddy, , L., Kreiman, G., och,69–72. C., and Invariant ariant visual , 4K(2), 691Fried, I. (2005). Inv represen representation tation by single neurons in the human brain. Natur Naturee , 435(7045), 1102–1107. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. (2005). Invariant visual 367 representation by single neurons in the human brain. Nature , 435(7045), 1102–1107. 367 764
BIBLIOGRAPHY
Radford, A., Metz, L., and Chintala, S. (2015). Unsup Unsupervised ervised representation learning with deep con conv volutional generativ generativee adv adversarial ersarial net netw works. arXiv pr preprint eprint arXiv:1511.06434 . Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised representation learning with 555, 703 , 704 deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 . Raik Raiko, o,, 703 T., , Y ao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive 555 704 distribution estimator (NADE-k). Technical rep report, ort, arXiv:1406.1485. 678, 711 Raiko, T., Yao, L., Cho, K., and Bengio, Y. (2014). Iterative neural autoregressive Raina, R., Madha Madhav van, A.,(NADE-k). and Ng, A.TY. (2009).rep Large-scale deep unsupervised distribution estimator echnical ort, arXiv:1406.1485. 678, 711learning using graphics pro processors. cessors. In L. Bottou and M. Littman, editors, Pr Pro oceedings of the Raina, R., Madha van, A., andConfer Ng, A. Y. (2009). Large-scale unsupervised learning Twenty-sixth International Conferenc enc ence e on Machine Learningdeep (ICML’09) (ICML’09), , pages 873–880, Pr o c e e dings of the usingYgraphics cessors. Bottou and M. Littman, editors, New ork, NY,pro USA. ACM.In27L., 449 Twenty-sixth International Conference on Machine Learning (ICML’09), pages 873–880, Ramsey Ramsey, P. NY, (1926). Truth and 27 probability probability. . In R. B. Braithw Braithwaite, aite, editor, The Foundations New ,YF. ork, USA. ACM. , 449 of Mathematics and other Logic gical al Essays , chapter 7, pages 156–198. McMaster Universit University y The F oundations Ramsey , F. P . (1926). T ruth and probability . In R. B. Braithw aite, editor, Arc Archiv hiv hivee for the History of Economic Thought. 56 of Mathematics and other Logical Essays , chapter 7, pages 156–198. McMaster University Ranzato, Hinton, of G.Economic H. (2010). Mo Modeling deling cov variances using ArchiveM. for and the History Thought. 56 pixel means and co factorized third-order Boltzmann mac machines. hines. In CVPR’2010 , pages 2551–2558. 682 Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and covariances using Ranzato, M., third-order Poultney Poultney,, C., Chopra, S., and LeCun, Y. (2007a). ,Efficient learning of 682 sparse factorized Boltzmann mac hines. In CVPR’2010 pages 2551–2558. represen representations tations with an energy-based model. In NIPS’2006 . 14, 19, 510, 531, 533 Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007a). Efficient learning of sparse Ranzato, Huang, Y., and LeCun, Y. (2007b).. Unsup Unsupervised learning represenM., tations withF.,anBoureau, energy-based model. In NIPS’2006 14, 19,ervised 510, 531 , 533 of in inv variant feature hierarchies with applications to ob object ject recognition. In Pr Pro oceedings of Ranzato, M., Huang, Boureau, LeCun, Y. (2007b). Unsupervised of the Computer VisionF.,and PatternY., Recand ognition Confer Conferenc enc encee (CVPR’07) (CVPR’07). . IEEE learning Press. 365 invariant feature hierarchies with applications to object recognition. In Proceedings of the Computer Vision Y., andand Pattern RecoY. gnition Confer encefeature (CVPR’07) Ranzato, M., Boureau, LeCun, (2008). Sparse learning for deep . IEEE Press.belief 365 net netw works. In NIPS’2007 . 510 Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief NIPS’2007, .A., Ranzato, M.,InKrizhevsky Krizhevsky, 3-wa ay restricted networks. 510and Hinton, G. E. (2010a). Factored 3-w Boltzmann mac machines hines for mo modeling deling natural images. In Pr Pro oceedings of AIST AISTA ATS 2010 . Ranzato, M., Krizhevsky , A., and Hinton, G. E. (2010a). F actored 3-w a y restricted 680, 681 Pr o c e e dings of AIST A TS 2010 . Boltzmann machines for modeling natural images. In Ranzato, 680, 681M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using gated MRFs. In NIPS’2010 . 682, 683 Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using NIPS’2010and Rao, C. (1945). Information the accuracy attainable in the estimation of statistical gated MRFs. In . 682 , 683 parameters. Bul Bulletin letin of the Calcutta Mathematic Mathematical al So Society ciety ciety,, 37, 81–89. 135, 295 Rao, C. (1945). Information and the accuracy attainable in the estimation of statistical Bul letin theala, Calcutta Mathematic SoRaik cietyo,, T. 37,(2015). Rasm Rasmus, us, A., Valp alpola, ola, H.,of Honk Honkala, M., Berglund, M., al and Raiko, Semi-supervised parameters. 81–89.Semi-sup 135, 295ervised learning with ladder net netw work. arXiv pr preprint eprint arXiv:1507.02672 . 429, 533 Rasmus, A., Valpola, H., Honkala, M., Berglund, M., and Raiko, T. (2015). Semi-supervised arXiv preprint arXiv:1507.02672 Rec Rech ht, B., with Re, C., Wright, S., and Niu, F. (2011). Hogwild: A .lo lock-free ck-free learning ladder network. 429 , 533 approach to parallelizing stochastic gradien gradientt descen descent. t. In NIPS’2011 . 450 Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to NIPS’2011 Reic Reichert, hert, D. P.,stochastic Seriès, P.,gradien and Storkey Storkey, , A. Neuronal parallelizing t descen t. J. In (2011). . 450 adaptation for samplingbased probabilistic inference in perceptual bistability bistability.. In Advanc dvances es in Neur Neural al Information Reic hert, D. P ., Seriès, P ., and Storkey , A. J. (2011). Neuronal adaptation for samplingPr Pro ocessing Systems Systems,, pages 2357–2365. 668 A dvanc es in Neur al Information based probabilistic inference in perceptual bistability. In 765 Processing Systems , pages 2357–2365. 668
BIBLIOGRAPHY
Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Sto Stochastic chastic backpropagation and appro approximate ximate inference in deep generativ generativee models. In ICML’2014 . Preprin Preprint: t: Rezende, D. J., Mohamed, arXiv:1401.4082. 655, 691,S., 698and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML’2014 . Preprint: Rifai, S., Vincen Vincent, t, 655 P P., ., ,Muller, Contractivee arXiv:1401.4082. 691, 698X., Glorot, X., and Bengio, Y. (2011a). Contractiv auto-enco auto-encoders: ders: Explicit inv invariance ariance during feature extraction. In ICML’2011 . 524, 525, Rifai, 526 S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML’2011 . 524, 525, Rifai, Vincent, t, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. 526 S., Mesnil, G., Vincen (2011b). Higher order contractiv contractivee auto-enco auto-encoder. der. In ECML PKDD . 524, 525 Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. ECML PKDD Rifai, S., Dauphin, Y., Vincent, P.,e auto-enco Bengio, Y., and X. (2011c). (2011b). Higher order contractiv der. In Muller, . 524, The 525 manifold tangen tangentt classifier. In NIPS’2011 . 270, 271 Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c). The manifold NIPS’2011 Rifai, S.,t Bengio, Y., and, 271 Vincent, P. (2012). A generative pro process cess for tangen classifier.Y.,InDauphin, . 270 sampling con contractiv tractiv tractivee auto-encoders. In ICML’2012 . 713 Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for Ringac Ringach, h, D.con andtractiv Shapley Shapley, , R. (2004). Reverse correlation neurophysiology.. Co Cognitive gnitive sampling e auto-encoders. In ICML’2012 . 713in neurophysiology Scienc Sciencee , 28(2), 147–166. 369 Ringach, D. and Shapley, R. (2004). Reverse correlation in neurophysiology. Cognitive Scienc Rob Roberts, erts,eS. and Everson, erson, R. (2001). Indep Independent endent comp omponent onent analysis: principles and , 28 (2),Ev 147–166. 369 pr practic actic acticee . Cambridge Univ Universit ersit ersity y Press. 496 Roberts, S. and Everson, R. (2001). Independent component analysis: principles and practice .A. Robinson, J. and Fallside, F. y(1991). A recurrent error propagation netw network ork sp speech eech Cambridge Universit Press. 496 recognition system. Computer Sp Speeech and Language anguage,, 5(3), 259–274. 27, 462 Robinson, A. J. and Fallside, F. (1991). A recurrent error propagation network speech Computer eech andprinceton Languagelandmarks Ro Roc ckafellar, R.system. T. (1997). Con Conv vexSpanalysis. in mathematics. 93 recognition , 5(3), 259–274. 27, 462 Romero, A.,R. Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Rockafellar, T. (1997). Convex analysis. princeton landmarks in mathematics. 93 Y. (2015). Fitnets: Hin Hints ts for thin deep nets. In ICLR’2015, arXiv:1412.6550 . 326 Romero, A., Ballas, N., Ebrahimi Kahou, S., Chassang, A., Gatta, C., and Bengio, Y. ICLR’2015, arXiv:1412.6550 Rosen, J. B. (1960).Hin The gradien gradient pro projection jection method d for nonlinear programming. (2015). Fitnets: ts for thin tdeep nets. Inmetho . 326 part i. linear constraints. Journal of the So Society ciety for Industrial and Applie Applied d Mathematics Mathematics,, 8(1), Rosen, J. B. (1960). pp. 181–217. 93 The gradient projection method for nonlinear programming. part i. linear constraints. Journal of the Society for Industrial and Applied Mathematics , 8(1), Rosen Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and pp. blatt, 181–217. 93 organization in the brain. Psycholo Psychologic gic gical al Review , 65, 386–408. 14, 15, 27 Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and Psycholo gicalodynamics Review , 65 Rosen Rosenblatt, blatt, F. (1962). Principles of Neur Neuro dynamics. . Spartan, New ork. organization in the brain. , 386–408. 14,Y15 , 2715, 27 Neurodynamics Ro Row weis, S. and Saul, L.Principles K. (2000).ofNonlinear dimensionalit dimensionality y reduction by15 lo locally Rosen blatt, F. (1962). . Spartan, New York. ,cally 27 linear em emb bedding. Scienc Sciencee , 290(5500). 163, 521 Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear Scienc e , 290 Ro Row weis, S., Saul, L., and Hinton, G. 163 (2002). local cal linear mo models. dels. In em bedding. (5500). , 521Global coordination of lo T. Dietteric Dietterich, h, S. Beck Becker, er, and Z. Ghahramani, editors, Advanc dvances es in Neur Neural al Information RoPr weis, S., Saul, L., and G. (2002). GlobalMA. coordination of lo492 cal linear models. In Pro ocessing Systems 14Hinton, (NIPS’01) (NIPS’01), , Cam Cambridge, bridge, MIT Press. A dvanc es in Neural Information T. Dietterich, S. Becker, and Z. Ghahramani, editors, Processing 14 (NIPS’01) Rubin, D. B. Systems et al. (1984). Bay Bayesianly esianly justifiable and relev relevan an ant t frequency , Cambridge, MA. MIT Press. 492 calculations for the applied statistician. The Annals of Statistics , 12(4), 1151–1172. 718 Rubin, D. B. et al. (1984). Bayesianly justifiable and relevant frequency calculations for 766 the applied statistician. The Annals of Statistics , 12(4), 1151–1172. 718
BIBLIOGRAPHY
Rumelhart, D., Hin Hinton, ton, G., and Williams, R. (1986a). Learning repres represen en entations tations by bac back-propagating k-propagating errors. Natur Naturee , 323, 533–536. 14, 18, 23, 203, 225, 374, 479, 485 Rumelhart, D., Hinton, G., and Williams, R. (1986a). Learning representations by e , 323 Rumelhart, D. E., Hinton, E., and Williams, R. J. internal ternal back-propagating errors.G.Natur , 533–536. 14(1986b). , 18, 23, Learning 203, 225, in 374 , 479represen, 485 tations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Par Paral al allel lel Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 318–362. (1986b). MIT Learning represenDistribute Distributed d Pr Pro ocessing essing, , volume 1, chapter 8, pages Press,internal Cam Cambridge. bridge. 21, tations 27 , 225 by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing , volume 1, chapter 8, pages 318–362. MIT Press, Cambridge. 21, Rumelhart, Paral al allel lel 27, 225 D. E., McClelland, J. L., and the PDP Research Group (1986c). Par Distribute Distributed d Pr Pro ocessing: Explor Explorations ations in the Micr Microstructur ostructur ostructuree of Co Cognition gnition . MIT Press, Rumelhart, D.17 E., McClelland, J. L., and the PDP Research Group (1986c). Parallel Cam Cambridge. bridge. Distributed Processing: Explorations in the Microstructure of Cognition . MIT Press, Russak Russako vsky vsky,, O., Karpathy,, Camobridge. 17 Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-F ei-Fei, ei, L. (2014a). ImageNet Large Russak , O., Deng, J., Su, H., Krause, Scaleovsky Visual Recognition Challenge. 21 J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2014a). ImageNet Large Russak Russako vsky, , O., Deng, J., Su, H., Krause, Karpathy,, Scaleovsky Visual Recognition Challenge. 21 J., Satheesh, S., Ma, S., Huang, Z., Karpathy A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition Russak ovsky, arXiv O., Deng, J., Su, H., Krause, J.,. Satheesh, S., Ma, S., Huang, Z., Karpathy, challenge. pr preprint eprint arXiv:1409.0575 28 A., Khosla, A., Bernstein, M., et al. (2014b). Imagenet large scale visual recognition arXiv preprint arXiv:1409.0575 Russel, S. J. and Norvig, P. (2003). Artificial. Intel Intelligenc ligence: e: a Mo Modern dern Appr Appro oach ach.. Pren Prentice tice challenge. 28 ligenc Hall. 86 Russel, S. J. and Norvig, P. (2003). Artificial Intel ligence: a Modern Approach . Prentice Rust, Sch hwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemp Spatiotemporal oral Hall.N., 86 Sc elemen elements ts of macaque V1 rec receptiv eptiv eptivee fields. Neur Neuron on on,, 46(6), 945–956. 368 Rust, N., Schwartz, O., Movshon, J. A., and Simoncelli, E. (2005). Spatiotemporal NeurRamabhadran, on , 46(6), 945–956. Sainath, A., rec Kingsbury Kingsbury, , B., and B. (2013). conv voluelemenT., ts ofMohamed, macaque V1 eptive fields. 368 Deep con tional neural net netw works for LVCSR. In ICASSP 2013 . 463 Sainath, T., Mohamed, A., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolu2013 Salakh Salakhutdino utdino utdinov, v, R. Learning in Marko Markov v random fields using temp tempered ered transitions. In tional neural net(2010). works for LVCSR. In ICASSP . 463 Y. Bengio, D. Sch Schuurmans, uurmans, C. Williams, J. Lafferty Lafferty,, and A. Culotta, editors, Advanc dvances es Salakh utdino v, R. (2010). Learning in Marko v random fields using temp ered transitions. In in Neur Neural al Information Pr Pro ocessing Systems 22 (NIPS’09) (NIPS’09).. 606 Y. Bengio, D. Schuurmans, C. Williams, J. Lafferty, and A. Culotta, editors, Advances in Neur al Information Pr ocessing SystemsDeep 22 (NIPS’09) Salakh Salakhutdino utdino utdinov, v, R. and Hin Hinton, ton, G. (2009a). Boltzmann. 606 machines. In Pr Pro oceedings of the International Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee and Statistics Statistics,, volume 5, pages Salakh utdino24 v,, R. G. (2009a). 448–455. 27,and 532,Hin 665ton, , 668 , 673 , 674 Deep Boltzmann machines. In Proceedings of the International Conference on Artificial Intel ligence and Statistics , volume 5, pages Salakh Salakhutdino utdino utdinov, v,, R. Hinton, ton, G., 673 (2009b). 448–455. 24 27,and 532Hin , 665 , 668 , 674 Semantic hashing. In International Journal of Appr Approximate oximate Reasoning . 528 Salakhutdinov, R. and Hinton, G. (2009b). Semantic hashing. In International Journal of Apprutdino oximate easoning Salakh Salakhutdino utdinov, v, R R. and Hin Hinton, ton, G. E. (2007a). Learning a nonlinear em embedding bedding b by y . 528 preserving class neigh neighb bourho ourhoo od structure. In Pr Pro oceedings of the Eleventh International Salakh utdino and HinIntel ton, ligenc G. E.e and (2007a). Learning aATS’07) nonlinear emJuan, bedding by Confer Conferenc enc enceev,onR.Artificial Intelligenc ligence Statistics (AIST (AISTA TS’07), , San Porto Pr o c e e dings of the Eleventh International preserving class neigh Rico. Omnipress. 530bourhood structure. In Conference on Artificial Intel ligence and Statistics (AISTATS’07) , San Juan, Porto Salakh Salakhutdino utdinov, v, R. and Hinton, ton, G. E. (2007b). Seman Semantic tic hashing. In SIGIR’2007 . 528 Rico.utdino Omnipress. 530Hin Salakhutdinov, R. and Hinton, G. E. (2007b). Semantic hashing. In SIGIR’2007 . 528 767
BIBLIOGRAPHY
Salakh Salakhutdino utdino utdinov, v, R. and Hinton, G. E. (2008). Using deep belief nets to learn cov covariance ariance kernels for Gaussian processes. In J. Platt, D. Koller, Y. Singer, and S. Row Roweis, eis, editors, Salakh utdino v, Neur R. and Hinton, G. E. (2008). deep elief nets to learn 1249–1256, covariance Advanc dvances es in Neural al Information Pr Pro ocessingUsing Systems 20b(NIPS’07) (NIPS’07), , pages k ernels for Gaussian processes. In J. Platt, D. Koller, Y. Singer, and S. Row eis, editors, Cam Cambridge, bridge, MA. MIT Press. 244 Advances in Neural Information Processing Systems 20 (NIPS’07), pages 1249–1256, Salakh Salakhutdino utdinov, v, MA. R. and Laro Laroc chelle,244 H. (2010). Efficient learning of deep Boltzmann machines. Camutdino bridge, MIT Press. In Pr Pro oceedings of the Thirte Thirteenth enth International Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee and Salakh utdino v, R. and Laro c helle, H. (2010). Efficient learning of deep Boltzmann Statistics (AIST (AISTA ATS 2010), JMLR W&CP , volume 9, pages 693–700. 655 machines. Pr o c e e dings of the Thirteenth International Conference on Artificial Intel ligence and In Statistics (AIST ATSMnih, 2010),A.JMLR Salakh Salakhutdino utdino utdinov, v, R. and (2008).W&CP Probabilistic factorization. In NIPS’2008 . , volumematrix 9, pages 693–700. 655 482 Salakhutdinov, R. and Mnih, A. (2008). Probabilistic matrix factorization. In NIPS’2008 . Salakh Salakhutdino utdinov, v, R. and Murray Murray,, I. (2008). On the quantitativ quantitativee analysis of deep belief 482 utdino net netw works. In W. W. Cohen, A. McCallum, and S. T. Ro Row weis, editors, Pr Pro oce ceeedings of Salakh utdino v, R. and Murray , I. (2008). On the quantitativ e analysis of deep belief the Twenty-fifth International Confer Conferenc enc encee on Machine Learning (ICML’08) (ICML’08),, volume 25, Pr o ce e dings of networks. In W.ACM. W. Cohen, A. McCallum, and S. T. Roweis, editors, pages 872–879. 631, 664 the Twenty-fifth International Conference on Machine Learning (ICML’08), volume 25, Salakh Salakhutdino utdino utdinov, v, R.,AMnih, A.,, and Hinton, ton, G. (2007). Restricted Boltzmann machines for pages 872–879. CM. 631 664 Hin collab collaborativ orativ orativee filtering. In ICML. 482 Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted Boltzmann machines for Sanger, T. D.e (1994). net netw ork learning con control trol of rob robot ot manipulators using collaborativ filtering. Neural In ICML .w 482 gradually increasing task difficulty difficulty.. IEEE Transactions on Rob obotics otics and Automation Automation,, Sanger, 10 10(3). (3). T. 329D. (1994). Neural network learning control of robot manipulators using gradually increasing task difficulty. IEEE Transactions on Robotics and Automation, Saul, L. K. 10(3). 329and Jordan, M. I. (1996). Exploiting tractable substructures in intractable net netw works. In D. Touretzky ouretzky,, M. Mozer, and M. Hasselmo, editors, Advanc dvances es in Neur Neural al Saul, L. K. andPr Jordan, I. (1996). Exploiting tractable in intractable Information Pro ocessingM. Systems 8 (NIPS’95) (NIPS’95). . MIT Press,substructures Cam Cambridge, bridge, MA. 641 networks. In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Saul, L. K., Jaakk Jaakkola, T., and Jordan, M. I. (1996). field theory forMA. sigmoid Information Proola, cessing Systems 8 (NIPS’95) . MITMean Press, Cam bridge, 641 belief net netw works. Journal of Artificial Intel Intelligenc ligenc ligencee Rese esear ar arch ch ch,, 4, 61–76. 27, 695 Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean field theory for sigmoid belief Sa Savic vic vich, A. W., Moussa, and Areibi, S. (2007). The of arithmetic Journal of M., Artificial Intelligenc e Rese archimpact neth, works. , 4, 61–76. 27, 695 representation on implemen implementing ting mlp-bp on fpgas: A study study.. Neur Neural al Networks, IEEE Transactions on , Savic h, A.240–252. W., Moussa, 18 18(1), (1), 454 M., and Areibi, S. (2007). The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. Neural Networks, IEEE Transactions on , Saxe, A. M., Koh, P454 . W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On random 18(1), 240–252. weigh eights ts and unsupervised feature learning. In Pr Pro oc. ICML’2011 . ACM. 364 Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., and Ng, A. (2011). On random Saxe, A.tsM., J. L., and learning. Ganguli, S. Exact solutions to 364 the nonlinear Proc. ICML’2011 weigh andMcClelland, unsupervised feature In(2013). . ACM. dynamics of learning in deep linear neural netw networks. orks. In ICLR. 285, 286, 303 Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear Sc Schaul, haul, T., Antonoglou, and linear Silver,neural D. (2014). Unit tests for .sto stoc chastic dynamics of learning inI.,deep networks. In ICLR 285 , 286, optimization. 303 In International Confer Conferenc enc encee on Learning Repr epresentations esentations . 309 Schaul, T., Antonoglou, I., and Silver, D. (2014). Unit tests for stochastic optimization. Conferenc e on Lecomplex, arning Repr esentations Sc Schmidh hmidh hmidhub ub uber, er, J. (1992). Learning extended sequences In International . 309 using the principle of history compression. Neur Neural al Computation Computation,, 4(2), 234–242. 401 Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of Sc Schmidh hmidh hmidhub uber, er, J. (1996). Neur Sequential neural text, 4compression. Neural al al Computation historyub compression. (2), 234–242.IEEE 401 Transactions on Neur Networks Networks,, 7(1), 142–146. 480 Schmidhuber, J. (1996). Sequential neural text compression. IEEE Transactions on Neural 768 Networks , 7(1), 142–146. 480
BIBLIOGRAPHY
Sc Schmidh hmidh hmidhub ub uber, er, J. (2012). Self-delimiting neural net netw works. arXiv pr preprint eprint arXiv:1210.0118 . 391 Schmidhuber, J. (2012). Self-delimiting neural networks. arXiv preprint arXiv:1210.0118 . Sc Schölk hölk hölkopf, Support ort ve vector ctor machines, 391 opf, B. and Smola, A. J. (2002). Learning with kernels: Supp regularization, optimization, and beyond . MIT press. 705 Schölkopf, B. and Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond Sc Schölk hölk hölkopf, opf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear comp component onent analysis as a . MIT press. 705 kernel eigen eigenv value problem. Neur Neural al Computation Computation,, 10, 1299–1319. 163, 521 Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a Neur Computation Sc Schölk hölk hölkopf, opf,eigen B., vBurges, C. J. C., andalSmola, A. J. (1999). Advanc dvances es in Methods ds — kernel alue problem. , 10, 1299–1319. 163Kernel , 521 Metho Supp Support ort Vector Learning arning.. MIT Press, Cam Cambridge, bridge, MA. 18, 142 Schölkopf, B., Burges, C. J. C., and Smola, A. J. (1999). Advances in Kernel Methods — Suppopf, ort B., Vector Learning Sc Schölk hölk hölkopf, Janzing, D., .Peters, J., Sgouritsa, E., Zhang, K., and Mo Mooij, oij, J. (2012). On MIT Press, Cambridge, MA. 18 , 142 causal and an anticausal ticausal learning. In ICML’2012 , pages 1255–1262. 548 Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On ICML’2012 Sc Sch hcausal uster, and M. (1999). Onlearning. sup supervised ervised from sequen sequential tial data with anticausal In learning , pages 1255–1262. 548 applications for sp recognition. 189 speec eec eech h Schuster, M. (1999). On supervised learning from sequential data with applications for Sc Sch hspuster, M. and Paliw Paliwal, al, K. (1997). Bidirectional recurrent neural netw networks. orks. IEEE eech recognition. 189 Transactions on Signal Pr Pro ocessing , 45(11), 2673–2681. 396 Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE ransactions on Signal Pruous ocessing Sc Sch hTw enk, H. (2007). Contin Continuous space language mo models. dels. Computer sp speeech and language language,, , 45 (11), 2673–2681. 396 21 21,, 492–518. 469 Schwenk, H. (2007). Continuous space language models. Computer speech and language , Sc Sch h21 wenk, H. (2010). Continuous uous space language mo models dels for statistical machine translation. , 492–518. 469 Contin The Pr Prague ague Bul Bulletin letin of Mathematic Mathematical al Linguistics , 93, 137–146. 476 Schwenk, H. (2010). Continuous space language models for statistical machine translation. Prague Bul letin of Mathematic Linguistics Sc Sch hThe wenk, H. (2014). Cleaned subset ofalWMT ’14 dataset. 21 , 93, 137–146. 476 Sc Sch hwenk, H. Bengio, Y. (1998). raining for adaptiv adaptive e boosting of neural netH. and (2014). Cleaned subsetTof WMTmethods ’14 dataset. 21 works. In M. Jordan, M. Kearns, and S. Solla, editors, Advanc dvances es in Neur Neural al Information SchPr woenk, H. and Bengio, (1998). T raining647–653. methodsMIT for adaptiv boosting of neural netPro cessing Systems 10 Y. (NIPS’97) (NIPS’97), , pages Press. e257 works. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in Neural Information cessing 10ain, (NIPS’97) Sc Sch hPr woenk, H. Systems and Gauv Gauvain, J.-L. (2002). Connectionist language modeling deling for large , pages 647–653. MIT Press. 257 mo vocabulary contin continuous uous sp speech eech recognition. In International Confer Conferenc enc encee on Acoustics, Sch weenk, H. and Gauv J.-L. (2002). , pages Connectionist modeling469 for large Sp Spe ech and Signal Pr Pro oain, cessing (ICASSP) (ICASSP), 765–768, language Orlando, Florida. vocabulary continuous speech recognition. In International Conference on Acoustics, ech and Processing (ICASSP) Sc Sch hSp weenk, H., Signal Costa-jussà, M. R., and Fonollosa, J. A. R.Orlando, (2006). Florida. Contin Continuous uous , pages 765–768, 469 space language mo models dels for the IWSL IWSLT T 2006 task. In International Workshop on Sp Spoken oken Sch enk, H.,TrCosta-jussà, M. 166–173. R., and F476 onollosa, J. A. R. (2006). Continuous space Lw anguage anslation anslation,, pages language models for the IWSLT 2006 task. In International Workshop on Spoken Language ranslation Seide, F., Li,TG., and Yu, D. (2011). Con Conv versational sp speech eech transcription using con contexttext, pages 166–173. 476 dep dependen enden endentt deep neural net netw works. In Intersp Interspeeech 2011 , pages 437–440. 23 Seide, F., Li, G., and Yu, D. (2011). Conversational speech transcription using contextech hines. 2011 , pages Sejno Sejnowski, (1987). Higher-order machines. In AIP437–440. Confer Conferenc enc ence Pro oceedings depwski, endenT. t deep neural networks.Boltzmann In Interspemac 23e Pr 151 on Neur Neural al Networks for Computing , pages 398–403. American Institute of Physics Sejno Inc.wski, 688 T. (1987). Higher-order Boltzmann machines. In AIP Conference Proceedings 151 on Neural Networks for Computing , pages 398–403. American Institute of Physics Inc. 688 769
BIBLIOGRAPHY
Series, P., Reic Reichert, hert, D. P., and Stork Storkey ey ey,, A. J. (2010). Hallucinations in Charles Bonnet syndrome induced by homeostasis: a deep Boltzmann mac machine hine mo model. del. In Advanc dvances es in Series, P.,Information Reichert, D.Pr Po.,cessing and Stork ey, A. J. (2010). Hallucinations in Charles Bonnet Neur Neural al Pro Systems , pages 2020–2028. 668 syndrome induced by homeostasis: a deep Boltzmann machine model. In Advances in Neural Information ocessing Systems Sermanet, P., Chintala,Pr S., and LeCun, Y. ,(2012). Con Convolutional volutional netw works applied pages 2020–2028. 668neural net to house num umbers bers digit classification. CoRR, abs/1204.3968. 459 Sermanet, P., Chintala, S., and LeCun, Y. (2012). Convolutional neural networks applied Sermanet, Ka Kavuk vuk vukcuoglu, cuoglu, K., Chintala,CoRR S., and LeCun, Y. (2013). Pedestrian detection to housePn.,um bers digit classification. , abs/1204.3968 . 459 with unsupervised multi-stage feature learning. In Pr Pro oc. International Confer Conferenc enc encee on Sermanet, P ., Ka vuk cuoglu, K., Chintala, S., and LeCun, Y. (2013). P edestrian detection Computer Vision and Pattern Recognition (CVPR’13) (CVPR’13).. IEEE. 23, 200 with unsupervised multi-stage feature learning. In Proc. International Conference on Computer Vision and Pattern ecoer gnition Shilo Shilov, v, G. (1977). Line Linear ar Algebr Algebra a. R Dov Dover Bo Books oks(CVPR’13) on Mathematics Series. Dov Dover er Publications. . IEEE. 23, 200 31 Shilov, G. (1977). Linear Algebra . Dover Books on Mathematics Series. Dover Publications. Siegelmann, H. (1995). Computation beyond the Turing limit. Scienc Sciencee , 268 (5210), 268(5210), 31 545–548. 380 Siegelmann, H. (1995). Computation beyond the Turing limit. Science , 268(5210), Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets nets.. Applie Applied d 545–548. 380 Mathematics Letters etters,, 4(6), 77–80. 380 Siegelmann, H. and Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters 4(6), 77–80. Siegelmann, H. T. and ,Sontag, E. D. (1995). On the computational pow ower er of neural nets. 380 Journal of Computer and Systems Scienc Sciences es es,, 50(1), 132–150. 380, 406 Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets. JournalJ. of Computer Systems Scienc es , 50neural Sietsma, and Dow, R. and (1991). Creating artificial net netw works that generalize. Neur Neural al (1), 132–150. 380 , 406 Networks Networks,, 4(1), 67–79. 241 Sietsma, J. and Dow, R. (1991). Creating artificial neural networks that generalize. Neural Networks 4(1), 67–79.P.241 Simard, D.,, Steinkraus, Y., and Platt, J. C. (2003). Best practices for con conv volutional neural net netw works. In ICDAR’2003 . 372 Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional Simard, . and Graf,InH.ICDAR’2003 P. (1994). Backpropagation without multiplication. In Advanc dvances es neuralPnet works. . 372 in Neur Neural al Information Pr Pro ocessing Systems Systems,, pages 232–239. 454 Simard, P. and Graf, H. P. (1994). Backpropagation without multiplication. In Advances in Neur Information Processing Simard, P.,al Victorri, B., LeCun, Y., Systems and Denk Denker, er, J. (1992). angent , pages 232–239.Tangen 454 t prop - A formalism for specifying selected inv invariances ariances in an adaptiv adaptivee net netw work. In NIPS’1991 . 269, 271, 357 Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism NIPS’1991 Simard, P. Y., LeCun, and Denk Denker, (1993). Efficien Efficient pattern recognition using a for specifying selectedY., invariances iner, anJ.adaptiv e net work.t In . 269, 271 , 357 new transformation distance. In NIPS’92 . 269 Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a Simard, P. Y., LeCun,distance. Y. A., Denker, J. S., and Victorri, B. (1998). Transformation new transformation In NIPS’92 . 269 in inv variance in pattern recognition — tangent distance and tangen tangentt propagation. Lectur cturee Simard, P. Computer Y., LeCun,Scienc Y. A., Denker, J. S., and Victorri, B. (1998). Transformation Notes in Science e , 1524 . 269 invariance in pattern recognition — tangent distance and tangent propagation. Lecture Notes in Scienc 1524 Simons, D. Computer J. and Levin, D.e ,T. (1998). . 269 Failure to detect changes to people during a real-w real-world orld in interaction. teraction. Psychonomic Bul Bulletin letin & Review eview,, 5(4), 644–649. 546 Simons, D. J. and Levin, D. T. (1998). Failure to detect changes to people during a Psychonomic letin & Rcon eview Simon Simony yan, K.inand Zisserman, A. (2015).Bul Very deep conv volutional netw networks orks for real-w orld teraction. , 5(4), 644–649. 546large-scale image recognition. In ICLR. 324 Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR. 324 770
BIBLIOGRAPHY
Sjöb Sjöberg, erg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with application to neural netw networks. orks. International Journal of Contr Control ol , 62(6), 1391–1407. Sjöb erg, J. and Ljung, L. (1995). Overtraining, regularization and searching for a minimum, 249 with application to neural networks. International Journal of Control , 62(6), 1391–1407. Skinner, today day day.. Americ merican an Psycholo Psychologist gist , 13, 94–99. 329 249 B. F. (1958). Reinforcement to an Psycholo gist , 13, F94–99. Smolensky Smolensky, P. (1958). (1986). Reinforcement Information pro processing cessing in dynamical systems: oundations Skinner, B., F. today . Americ 329 of harmon harmony y theory theory.. In D. E. Rumelhart and J. L. McClelland, editors, Par Paral al allel lel Distribute Distributed d Smolensky , P, .v(1986). processing in MIT dynamical systems: Foundations of Pr Pro ocessing olume 1, Information chapter 6, pages 194–281. Press, Cam Cambridge. bridge. 574, 590, 658 harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Prek, ocessing Sno Snoek, J., Laro Larochelle, H.,chapter and Adams, R.194–281. P. (2012). Practical Bay Bayesian esian optimization of , vchelle, olume 1, 6, pages MIT Press, Cam bridge. 574, 590, 658 mac machine hine learning algorithms. In NIPS’2012 . 439 Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of So Socmac cher,hine R., Huang, H., Pennington, J., Ng, A.. Y., learningE.algorithms. In NIPS’2012 439and Manning, C. D. (2011a). Dynamic pooling and unfolding recursiv recursivee autoenco autoencoders ders for paraphrase detection. In NIPS’2011 . So401 cher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic , 403 pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011 . So Soc401 cher, R., Manning, C., and Ng, A. Y. (2011b). Parsing natural scenes and natural lan, 403 guage with recursive neural net netw works. In Pr Pro oceedings of the Twenty-Eighth International SoConfer cher, R., Manning, C., and Ng, A.(ICML’2011) Y. (2011b). P Conferenc enc ence e on Machine Learning (ICML’2011). . arsing 401 natural scenes and natural language with recursive neural networks. In Proceedings of the Twenty-Eighth International Confer enceP on Machine J., Learning So Soc cher, R., Pennington, ennington, Huang,(ICML’2011) E. H., Ng, .A. 401Y., and Manning, C. D. (2011c). Semi-sup Semi-supervised ervised recursiv recursivee autoenco autoencoders ders for predicting sen sentimen timen timentt distributions. In SoEMNLP’2011 cher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011c). . 401 Semi-supervised recursive autoencoders for predicting sentiment distributions. In So SocEMNLP’2011 cher, R., Perelygin, Chuang, uang, J., Manning, C. D., Ng, A. Y., and Potts, . 401 A., Wu, J. Y., Ch C. (2013a). Recursive deep mo models dels for semantic comp compositionality ositionality ov over er a sen sentimen timen timentt Sotreebank. cher, R., PIn erelygin, A., Wu,. J. Chuang, J., Manning, C. D., Ng, A. Y., and Potts, EMNLP’2013 401Y., , 403 C. (2013a). Recursive deep models for semantic compositionality over a sentiment So Soctreebank. cher, R., Ganjo Ganjoo, o, M., Manning, C., 403 D., and Ng, A. Y. (2013b). Zero-shot learning through In EMNLP’2013 . 401 cross-mo cross-modal dal transfer. In 27th Annual Confer Conferenc enc encee on Neur Neural al Information Pr Pro ocessing SoSystems cher, R., (NIPS Ganjoo,2013) M., Manning, C. D., and Ng, A. Y. (2013b). Zero-shot learning through 2013).. 542 27th Annual Confer enc e on Neur al Information Pr o cessing cross-modal transfer. In Systems (NIPS 2013) Sohl-Dic Sohl-Dickstein, kstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep . 542 unsup unsupervised ervised learning using nonequilibrium thermo thermodynamics. dynamics. 717, 718 Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. (2015). Deep Sohn, K., Zhou,learning G., andusing Lee, nonequilibrium H. (2013). Learning selecting jointly tly with unsup ervised thermoand dynamics. 717features , 718 join poin oint-wise t-wise gated Boltzmann mac machines. hines. In ICML’2013 . 689 Sohn, K., Zhou, G., and Lee, H. (2013). Learning and selecting features jointly with Solomonoff, J. (1989). A system for incremental learning point-wiseR. gated Boltzmann machines. In ICML’2013 . 689based on algorithmic probabilit bility y. 329 Solomonoff, R. J. (1989). A system for incremental learning based on algorithmic probaSon Sontag, tag,y. E. D. (1998). VC dimension of neural netw networks. orks. NA NATO TO ASI Series F Computer bilit 329 and Systems Scienc Sciences es es,, 168, 69–96. 550, 554 Sontag, E. D. (1998). VC dimension of neural networks. NATO ASI Series F Computer and Scienc es , 168 Son Sontag, tag,Systems E. D. and Sussman, H. J. (1989). Backpropagation local cal , 69–96. 550,Bac 554kpropagation can give rise to spurious lo minima ev even en for net netw works without hidden lay layers. ers. Complex Systems , 3, 91–106. 284 Sontag, E. D. and Sussman, H. J. (1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems , 3, 91–106. 284 771
BIBLIOGRAPHY
Spark Sparkes, es, B. (1996). The Red and the Black: Studies in Gr Greeek Pottery Pottery.. Routledge. 1 The ReH., d and Black:, D. Studies in F Gr eek bab Pottery Spitk Spitko oes, vsky vsky, I., Alshawi, andthe Jurafsky Jurafsky, (2010). rom baby y steps to leapfrog:1 ho how w Spark B., V. (1996). . Routledge. “less is more” in unsup unsupervised ervised dep dependency endency parsing. In HL HLT’10 T’10 . 329 Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. (2010). From baby steps to leapfrog: how Squire, W.more” and Tinrapp, G.ervised (1998).dep Using complex variables toT’10 estimate derivatives atives of real “less is unsup endency parsing. In HL . 329 deriv functions. SIAM Rev. ev.,, 40(1), 110––112. 442 Squire, W. and Trapp, G. (1998). Using complex variables to estimate derivatives of real SIAM Rev., 40 Srebro, N. and Shraibman, A.(1), (2005). Rank,442 trace-norm and max-norm. In Pr Pro oceedings of functions. 110––112. the 18th Annual Confer Conferenc enc encee on Learning The Theory ory ory,, pages 545–560. Springer-V Springer-Verlag. erlag. 239 Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of the 18th Confer encoving e on LNeur earning Theory , pages Sriv Srivastav astav astava, a, Annual N. (2013). Impr Improving Neural al Networks With 545–560. Dr Drop op opout out out..Springer-V Master’s erlag. thesis,239 U. Toronto. 538 Srivastava, N. (2013). Improving Neural Networks With Dropout . Master’s thesis, U. Sriv Srivastav astava, a, N. Salakhutdino utdino utdinov, v, R. (2012). Multimo Multimodal dal learning with deep Boltzmann Tastav oronto. 538and Salakh mac machines. hines. In NIPS’2012 . 544 Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep Boltzmann NIPS’2012 .v,544 Sriv Srivastav astav astava, a, N.,InSalakhutdino Salakhutdinov, R. R., and Hin Hinton, ton, G. E. (2013). Mo Modeling deling do documents cuments with mac hines. deep Boltzmann mac machines. hines. arXiv pr preprint eprint arXiv:1309.6865 . 665 Srivastava, N., Salakhutdinov, R. R., and Hinton, G. E. (2013). Modeling documents with arXiv preprint arXiv:1309.6865 Sriv Srivastav astav astava, a, N., Hinton, G., Krizhevsky Krizhevsky, , A., Sutskev Sutskever, er, I., and .Salakhutdino Salakhutdinov, v, R. (2014). deep Boltzmann machines. 665 Drop Dropout: out: A simple way to prev preven en entt neural netw networks orks from ov overfitting. erfitting. Journal of Machine Sriv a, N., Hinton, Krizhevsky257 , A., Sutskev and Salakhutdinov, R. (2014). Leastav arning Rese esear ar arch ch ch,, 15G., , 1929–1958. , 263 , 264,er, 265I., , 674 Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine L eastav arning esear ch ,Greff, 15, 1929–1958. Sriv Srivastav astava, a, RR. K., K., and 257 Schmidh Schmidhuber, uber, J., 674 (2015). High Highw way netw networks. orks. , 263, 264 , 265 arXiv:1505.00387 . 327 Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway networks. arXiv:1505.00387 Steinkrau, D., Simard, . 327P. Y., and Buck, I. (2005). Using GPUs for machine learning algorithms. 2013 12th International Confer Conferenc enc encee on Do Document cument Analysis and Recognition gnition,, Steinkrau, D., Simard, P. Y., and Buck, I. (2005). Using GPUs for machine learning 0, 1115–1119. 448 algorithms. 2013 12th International Conference on Document Analysis and Recognition , 0y,ano Sto Stoy anov, v, V., Ropson, 1115–1119. 448 A., and Eisner, J. (2011). Empirical risk minimization of graphical mo model del parameters giv given en approximate inference, deco decoding, ding, and mo model del structure. In StoPr y ano v, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of Statistics graphical Pro oceedings of the 14th International Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee and mo delAparameters inference, decoenc ding, and del, pages structure. In (AIST (AISTA TS) TS),, volumegiv 15en of approximate JMLR Workshop and Confer Conferenc ence e Pr Pro oceemo dings dings, 725–733, Pr oceLauderdale. edings of the Supplemen 14th International Confer(4 enc e on Artificial Intel ligenc and Statistics F ort Supplementary tary material pages) also av available. ailable. 676,e 700 (AISTATS), volume 15 of JMLR Workshop and Conference Proceedings , pages 725–733, Sukh Sukhbaatar, baatar, S., Szlam, A., Weston, and Fergus, R. (2015). eakly sup supervised Fort Lauderdale. Supplemen tary J., material (4 pages) also avW ailable. 676ervised , 700 memory net netw works. arXiv pr preprint eprint arXiv:1503.08895 . 421 Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2015). Weakly supervised memory preprint arXiv:1503.08895 Supancic, J. arXiv and Ramanan, D. (2013). Self-paced tracking. king. In networks. . 421 learning for long-term trac CVPR’2013 . 329 Supancic, J. and Ramanan, D. (2013). Self-paced learning for long-term tracking. In CVPR’2013 Sussillo, D. (2014). feed-forward ard net netw works . 329 Random walks: Training very deep nonlinear feed-forw with smart initialization. CoRR, abs/1412.6558. 290, 303, 305, 405 Sussillo, D. (2014). Random walks: Training very deep nonlinear feed-forward networks CoRR Sutsk Sutskev ev ever, er, I. (2012). Training Recurr current ent Neur Neural al Networks Networks. Ph.D. with smart initialization. , abs/1412.6558 . 290, .303 , 305thesis, , 405 Department of computer science, Univ Universit ersit ersity y of Toronto. 408, 415 Sutskever, I. (2012). Training Recurrent Neural Networks . Ph.D. thesis, Department of computer science, University of Toronto.772 408, 415
BIBLIOGRAPHY
Sutsk Sutskev ev ever, er, I. and Hin Hinton, ton, G. E. (2008). Deep narro narrow w sigmoid belief net netw works are univ universal ersal appro approximators. ximators. Neur Neural al Computation , 20(11), 2629–2636. 695 Sutskever, I. and Hinton, G. E. (2008). Deep narrow sigmoid belief networks are universal Sutsk I. and Tieleman, T. (2010)., 20 On(11), the 2629–2636. Conv Prop Neural Computation Sutskev ev ever, er, Convergence ergence Properties erties of Contrastiv Contrastivee appro ximators. 695 Div Divergence. ergence. In Y. W. Teh and M. Titterington, editors, Pr Pro oc. of the International Sutsk ever, I.e and Tieleman, T.ligenc (2010). the Conv ergence Prop erties9, of Contrastiv e Confer Conferenc enc ence on Artificial Intel Intelligenc ligence e andOn Statistics (AIST (AISTA ATS) TS), , volume pages 789–795. Div 615 ergence. In Y. W. Teh and M. Titterington, editors, Proc. of the International Conference on Artificial Intel ligence and Statistics (AISTATS), volume 9, pages 789–795. Sutsk Sutskev ever, er, I., Hin Hinton, ton, G., and Taylor, G. (2009). The recurren recurrentt temp temporal oral restricted 615ev Boltzmann mac machine. hine. In NIPS’2008 . 688 Sutskever, I., Hinton, G., and Taylor, G. (2009). The recurrent temporal restricted Sutsk Sutskev ev ever, er, I., mac Martens, J.,NIPS’2008 and Hinton, G. E. (2011). Generating text with recurrent Boltzmann hine. In . 688 neural net netw works. In ICML’2011 , pages 1017–1024. 480 Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent Sutsk Sutskev ev ever, er,netI., Martens, J., Dahl, , G., and1017–1024. Hinton, G. importance ortance of neural works. In ICML’2011 pages 480(2013). On the imp initialization and momen momentum tum in deep learning. In ICML. 300, 408, 415 Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of ICML. 300 Sutsk Sutskev ev ever, er, I., Viny Vinyals, O., tum and in Le,deep Q. V. (2014).InSequence to, sequence initialization andals, momen learning. 408, 415 learning with neural net netw works. In NIPS’2014, arXiv:1409.3215 . 25, 101, 397, 411, 414, 477, 478 Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with Sutton, and Barto, A. (1998). RarXiv:1409.3215 einfor einforccement Learning: Intr Intro oduction duction. MIT Press. neuralR.net works. In NIPS’2014, . 25, 101An , 397 , 411 , 414, .477 , 478 106 Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction . MIT Press. Sutton, methods ds 106 R. S., Mcallester, D., Singh, S., and Mansour, Y. (2000). Policy gradient metho for reinforcement learning with function approximation. In NIPS’1999 , pages 1057– Sutton, S., Mcallester, –1063.R.MIT Press. 693 D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In NIPS’1999 , pages 1057– Sw Swersky ersky ersky,, MIT K., Ranzato, –1063. Press. 693M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On auto autoenco enco encoders ders and score matc matching hing for energy based models. In ICML’2011 . ACM. 516 Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On Sw Swersky ersky ersky, , K., Sno Snoek, ek, score J., and Adams, R.energy P. (2014). reeze-thaw Bay Bayesian esian optimization. auto enco ders and matc hing for basedFmodels. In ICML’2011 . ACM. 516 arXiv pr preprint eprint arXiv:1406.3896 . 439 Swersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw Bayesian optimization. arXiv, pr arXiv:1406.3896 Szegedy Szegedy, C.,eprint Liu, W., Jia, Y., Sermanet, Anguelov, v, D., Erhan, D., Vanhouck anhoucke, e, . 439 P., Reed, S., Anguelo V., and Rabinovic Rabinovich, h, A. (2014a). Going deep deeper er with conv convolutions. olutions. Tec echnical hnical rep report, ort, Szegedy , C., Liu, W.,24Jia, Sermanet, Reed, arXiv:1409.4842. , 27Y., , 200 , 257, 267P,.,327 , 348S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014a). Going deeper with convolutions. Technical report, Szegedy Szegedy, , C., Zarem Zaremba, ba, Sutskev Sutskever, Bruna, Goo odfellow, I. J., and arXiv:1409.4842. 24, W., 27, 200 , 257,er, 267I., , 327 , 348 J., Erhan, D., Go Fergus, R. (2014b). In Intriguing triguing prop properties erties of neural netw networks. orks. ICLR, abs/1312.6199 abs/1312.6199.. Szegedy , C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I. J., and 267, 268 , 270 Fergus, R. (2014b). Intriguing properties of neural networks. ICLR, abs/1312.6199. Szegedy, C., Vanhouck anhoucke, e, V., Ioffe, S., Shlens, J., and Wo jna, Z. (2015). Rethinking the 267, 268 , 270 Inception Arc Architecture hitecture for Computer Vision. ArXiv e-prints . 244, 323 Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wo jna, Z. (2015). Rethinking the e-prints Taigman, Y.,Arc Yang, M., Ranzato, M., and Wolf, ArXiv L. (2014). DeepF Closing the gap to DeepFace: ace: Inception hitecture for Computer Vision. . 244 , 323 human-lev uman-level el performance in face verification. In CVPR’2014 . 100 Taigman, Y., Yang, M., Ranzato, M., and Wolf, L. (2014). DeepFace: Closing the gap to Tandy andy, , D. W.el (1997). Worksinand A Translation and Commentary for the So Social cial human-lev performance faceDays: verification. In CVPR’2014 . 100 Scienc Sciences es . Universit University y of California Press. 1 Tandy, D. W. (1997). Works and Days: A Translation and Commentary for the Social Sciences . University of California Press.773 1
BIBLIOGRAPHY
Tang, Y. and Eliasmith, C. (2010). Deep netw networks orks for robust visual recognition. In Pr Pro oceedings of the 27th International Confer Conferenc enc encee on Machine Learning, June 21-24, Tang, and Eliasmith, 2010,Y.Haifa, Isr Israel ael . 241C. (2010). Deep networks for robust visual recognition. In Proceedings of the 27th International Conference on Machine Learning, June 21-24, 2010, Israel . 241 Tang, Y.,Haifa, Salakhutdino Salakhutdinov, v, R., and Hin Hinton, ton, G. (2012). Deep mixtures of factor analysers. arXiv pr preprint eprint arXiv:1206.4635 . 492 Tang, Y., Salakhutdinov, R., and Hinton, G. (2012). Deep mixtures of factor analysers. arXivG. preprint arXiv:1206.4635 Taylor, and Hin Hinton, ton, G. (2009).. F492 actored conditional restricted Boltzmann mac machines hines for mo modeling deling motion style. In L. Bottou and M. Littman, editors, Pr Pro oceedings of Taylor, G. and Hinton, G. (2009). Confer Factored restricted Boltzmann mac hines the Twenty-sixth International Conferenc enc enceeconditional on Machine Learning (ICML’09) (ICML’09), , pages Pr o c e e dings of for modeling motion style. InCanada. L. Bottou and688 M. Littman, editors, 1025–1032, Mon Montreal, treal, Quebec, ACM. the Twenty-sixth International Conference on Machine Learning (ICML’09) , pages Taylor, G., Hinton, G. E., and Row Roweis, eis, S. (2007). Modeling human motion using binary 1025–1032, Montreal, Quebec, Canada. ACM. 688 laten latentt variables. In B. Schölk Schölkopf, opf, J. Platt, and T. Hoffman, editors, Advanc dvances es in Neur Neural al Taylor, G., Hinton, G. E., and Row eis, S. (2007). Modeling human motion using binary Information Pr Pro ocessing Systems 19 (NIPS’06) (NIPS’06),, pages 1345–1352. MIT Press, Cambridge, latent687 variables. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural MA. Information Processing Systems 19 (NIPS’06) , pages 1345–1352. MIT Press, Cambridge, Teh, Y.,687 Welling, M., Osindero, S., and Hin Hinton, ton, G. E. (2003). Energy-based mo models dels MA. for sparse ov overcomplete ercomplete representations. Journal of Machine Learning Rese esear ar arch ch ch,, 4, Teh, Y., Welling, 1235–1260. 494 M., Osindero, S., and Hinton, G. E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research , 4, Tenenbaum, de Silv Silva, a, V., and Langford, J. C. (2000). A global geometric framework 1235–1260.J., 494 for nonlinear dimensionalit dimensionality y reduction. Scienc Sciencee , 290(5500), 2319–2323. 163, 521, 536 Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework e , 290 Theis, L., van den Oord, A., and Bethge, M.Scienc (2015). A note on the ev evaluation aluation163 of ,generative for nonlinear dimensionalit y reduction. (5500), 2319–2323. 521, 536 mo models. dels. arXiv:1511.01844. 699, 721 Theis, L., van den Oord, A., and Bethge, M. (2015). A note on the evaluation of generative Thompson, J., Jain, A., LeCun,699 Y.,, and conv volutional models. arXiv:1511.01844. 721 Bregler, C. (2014). Joint training of a con net netw work and a graphical model for human pose estimation. In NIPS’2014 . 361 Thompson, J., Jain, A., LeCun, Y., and Bregler, C. (2014). Joint training of a convolutional Thrun, S. (1995). Learning m toodel playfor the game pofose chess. In NIPS’1994 . 269 . 361 network and a graphical human estimation. In NIPS’2014 Tibshirani, R. J. (1995). Regression shrink shrinkage ageofand selection via the lasso. Thrun, S. (1995). Learning to play the game chess. In NIPS’1994 . 269 Journal of the Royal Statistic Statistical al So Society ciety B , 58, 267–288. 236 Tibshirani, R. J. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistic al Society B , 58 Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to , 267–288. 236 the likelihoo likelihood d gradient. In W. W. Cohen, A. McCallum, and S. T. Row Roweis, eis, editors, Pr ProoTieleman, (2008). TrainingInternational restricted Boltzmann using to, ceedings T. of the Twenty-fifth Confer Conferenc enc enceemachines on Machine Leapproximations arning (ICML’08) (ICML’08), the likelihoo d gradient. W. W. Cohen, A. McCallum, and S. T. Roweis, editors, Propages 1064–1071. ACM.In615 ceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), Tieleman, T. and Hinton, (2009). Using fast weigh eights ts to impro improv ve persistent con contrastiv trastiv trastivee pages 1064–1071. ACM.G.615 div divergence. ergence. In L. Bottou and M. Littman, editors, Pr Pro oceedings of the Twenty-sixth Tieleman, T. andConfer Hinton, G. (2009). Using weigh ts to impro ve persistent contrastiv e International Conferenc enc ence e on Machine Lefast arning (ICML’09) (ICML’09), , pages 1033–1040. ACM. Pr o c e e dings of the Twenty-sixth divergence. In L. Bottou and M. Littman, editors, 617 International Conference on Machine Learning (ICML’09) , pages 1033–1040. ACM. Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal comp componen onen onents ts analysis. 617 Journal of the Royal Statistic Statistical al So Society ciety B , 61(3), 611–622. 494 Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. Journal of the Royal Statistical Society B , 61(3), 611–622. 494 774
BIBLIOGRAPHY
Torralba, A., Fergus, R., and Weiss, Y. (2008). Small co codes des and large databases for recognition. In Pr Pro oceedings of the Computer Vision and Pattern Recognition Confer Conferenc enc encee Torralba, A., ,Fpages ergus,1–8. R., 528 and Weiss, Y. (2008). Small codes and large databases for (CVPR’08) (CVPR’08), recognition. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’08) Touretzky ouretzky, , D., S. and1–8. Minton, pages 528 G. E. (1985). Symbols among the neurons: Details of a connectionist inference architecture. In Pr Pro oceedings of the 9th International Joint Touretzky , D. S. and Minton, G. E. (1985). Symbols among neurons: of Confer Conferenc enc encee on Artificial Intel Intelligenc ligenc ligencee - Volume 1 , IJCAI’85, pagesthe 238–243, SanDetails Francisco, Pr o c e e dings of the 9th International Joint a connectionist inference architecture. CA, USA. Morgan Kaufmann PublishersInInc. 17 Conference on Artificial Intel ligence - Volume 1 , IJCAI’85, pages 238–243, San Francisco, Tu,CA, K. USA. and Honav Honavar, ar, Kaufmann V. (2011). Publishers On the utility Morgan Inc. of 17curricula in unsupervised learning of probabilistic grammars. In IJCAI’2011 . 329 Tu, K. and Honavar, V. (2011). On the utility of curricula in unsupervised learning of IJCAI’2011 Turaga, S. C., Murray Murray, , J. F., V., Roth, F., Helmstaedter, M., Briggman, K., Denk, probabilistic grammars. In Jain, . 329 W., and Seung, H. S. (2010). Conv Convolutional olutional net netw works can learn to generate affinity Turaga, S. C., Murray , J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, graphs for image segmen segmentation. tation. Neur Neural al Computation Computation,, 22(2), 511–538. 360 K., Denk, W., and Seung, H. S. (2010). Convolutional networks can learn to generate affinity NeurY. al (2010). Computation Turian, L., and Bengio, Word, 22 representations: A simple and graphsJ.,forRatinov, image segmen tation. (2), 511–538. 360 general method for semi-sup semi-supervised ervised learning. In Pr Pro oc. ACL’2010 , pages 384–394. 538 Turian, J., Ratinov, L., and Bengio, Y. (2010). Word representations: A simple and Proc.BigChaos ACL’2010 Tösc Töscher, her, A., Jahrer, and Bell, R. learning. M. (2009).In The solution the Netflix general method forM., semi-sup ervised , pages to 384–394. 538 grand prize. 482 Töscher, A., Jahrer, M., and Bell, R. M. (2009). The BigChaos solution to the Netflix Uria, B., prize. Murra Murray y, I., and Larochelle, H. (2013). Rnade: The real-v real-valued alued neural autoregresgrand 482 siv sivee densit density-estimator. y-estimator. In NIPS’2013 . 711 Uria, B., Murray, I., and Larochelle, H. (2013). Rnade: The real-valued neural autoregresvansivden Oörd, A., Dieleman, S., and Schrau Schrauw conten ten tent-based t-based music e densit y-estimator. In NIPS’2013 . 711 wen, B. (2013). Deep con recommendation. In NIPS’2013 . 483 van den Oörd, A., Dieleman, S., and Schrauwen, B. (2013). Deep content-based music vanrecommendation. der Maaten, L. and Hinton, G.. E. In NIPS’2013 483(2008). Visualizing data using t-SNE. J. Machine Learning Res. es.,, 9. 480, 522 van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine Learning Res.Senior, Vanhouck anhoucke, e, V., A., and Mao, M. Z. (2011). Impro Improving ving the sp speed eed of neural netw networks orks , 9. 480 , 522 on CPUs. In Pr Pro oc. De Deep ep Learning and Unsup Unsupervise ervise ervised d Featur aturee Learning NIPS Workshop orkshop.. Vanhouck 447, 455e, V., Senior, A., and Mao, M. Z. (2011). Improving the speed of neural networks on CPUs. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop . Vapnik, V. N. (1982). Estimation of Dep Dependenc endenc endences es Base Based d on Empiric Empirical al Data . Springer447, 455 Verlag, Berlin. 114 Vapnik, V. N. (1982). Estimation of Dependences Based on Empirical Data . SpringerVapnik, N. (1995). Naturee of Statistic Statistical al Learning The Theory ory . Springer, New York. Verlag,V.Berlin. 114 The Natur 114 Vapnik, V. N. (1995). The Nature of Statistical Learning Theory . Springer, New York. Vapnik, conv vergence of relative 114 V. N. and Chervonenkis, A. Y. (1971). On the uniform con frequencies of ev even en ents ts to their probabilities. The Theory ory of Pr Prob ob obability ability and Its Applic Applications ations ations,, Vapnik, V. N. and 16 16,, 264–280. 114Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications , 16, 264–280. Vincen Vincent, t, P. (2011). etween een score matching and denoising auto autoenco enco encoders. ders. 114 A connection betw Neur Neural al Computation Computation,, 23(7). 516, 518, 714 Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation , 23(7). 516, 518, 714 775
BIBLIOGRAPHY
Vincen Vincent, t, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. 523 Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002 . MIT Press. Vincen Vincent, Larochelle, chelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and 523 t, P., Laro comp composing osing robust features with denoising auto autoenco enco encoders. ders. In ICML 2008 . 241, 518 Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and 2008 Vincen Vincent, P., Laro Larochelle, chelle, H., La Lajoie, joie,denoising I., Bengio, Y.,enco andders. Manzagol, P.-A. (2010). Stacked compt,osing robust features with auto In ICML . 241, Stack 518 ed denoising auto autoenco enco encoders: ders: Learning useful represen representations tations in a deep netw network ork with a lo local cal Vincen t, P., criterion. Larochelle, La joie,LeI., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising J. H., Machine arning Res. es., , 11 . 518 denoising autoencoders: Learning useful representations in a deep network with a local J. Machine earning Res. 11. 518 Efficien Vincen Vincent, t, P., de Brébisson, A., and L Bouthillier, X., (2015). Efficientt exact gradien gradientt update denoising criterion. for training deep netw networks orks with very large sparse targets. In C. Cortes, N. D. Lawrence, Vincen t, Lee, P., deM. Brébisson, A.,and andR. Bouthillier, (2015). Efficien gradien t update D. D. Sugiyama, Garnett, X. editors, Advanc dvances es tinexact Neur Neural al Information forotraining deep netw with1108–1116. very large sparse In C.Inc. Cortes, Pr Pro cessing Systems 28orks , pages Currantargets. Asso Associates, ciates, 468 N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Pr cessing 28 , Koo, Vin Viny yo als, O., Systems Kaiser, L., Petro Petrov, v, Curran S., Sutsk Sutskev ev ever, er, I., and G. (2014a). pagesT., 1108–1116. Asso ciates, Inc. Hinton, 468 Grammar as a foreign language. Technical rep report, ort, arXiv:1412.7449. 411 Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., and Hinton, G. (2014a). Vin Viny yals, O., Tas oshev, A., Bengio, S., and Erhan,rep D.ort, (2014b). Sho Show w and tell: Grammar a foreign language. Technical arXiv:1412.7449. 411a neural image caption generator. arXiv 1411.4555. 411 Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2014b). Show and tell: a neural image Vin Viny yals, O., Fortunato, M.,1411.4555. and Jaitly Jaitly,411 , N. (2015a). Poin ointer ter netw networks. orks. arXiv pr preprint eprint caption generator. arXiv arXiv:1506.03134 . 421 Vinyals, O., Fortunato, M., and Jaitly, N. (2015a). Pointer networks. arXiv preprint arXiv:1506.03134 Vin Viny yals, O., Toshev, .A., Bengio, S., and Erhan, D. (2015b). Sho Show w and tell: a neural image 421 caption generator. In CVPR’2015 . arXiv:1411.4555. 102 Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015b). Show and tell: a neural image Viola, P. and Jones, In M.CVPR’2015 (2001). Robust real-time ob object ject caption generator. . arXiv:1411.4555. 102 detection. In International Journal of Computer Vision Vision.. 452 Viola, P. and Jones, M. (2001). Robust real-time ob ject detection. In International Journal of Computer Visin, F., Kastner, K., Vision Cho, K., Matteucci, M., Courville, A., and Bengio, Y. (2015). . 452 ReNet: A recurrent neural netw network ork based alternativ alternativee to con conv volutional net netw works. arXiv Visin, F., Kastner, K., Cho,. K., pr preprint eprint arXiv:1505.00393 397Matteucci, M., Courville, A., and Bengio, Y. (2015). ReNet: A recurrent neural network based alternative to convolutional networks. arXiv preprint arXiv:1505.00393 Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal . 397 pro projections jections directed to the auditory pathw pathwa ay. Natur Naturee , 404(6780), 871–876. 16 Von Melchner, L., Pallas, S. L., and Sur, M. (2000). Visual behaviour mediated by retinal e , 404(6780), Wager, S., Wang, S., and Liang, P. (2013). training as adaptiv adaptive e regularization. pro jections directed to the auditory pathwDropout ay. Natur 871–876. 16 In Advanc dvances es in Neur Neural al Information Pr Pro ocessing Systems 26 , pages 351–359. 264 Wager, S., Wang, S., and Liang, P. (2013). Dropout training as adaptive regularization. es in Neur Information ocessing Systems 26 ,Lang, Waib aibel, el,dvanc A., Hanazaw Hanazawa, a,alT., Hinton, G.Pr E., Shik Shikano, ano, K., and (1989). 264 Phoneme In A pagesK. 351–359. recognition using time-delay neural net netw works. IEEE Transactions on Acoustics, Sp Speeech, Waib A., Hanazaw a, T., G. 375 E., ,Shik andel,Signal Pr Pro ocessing essing, , 37Hinton, , 328–339. 456ano, , 462 K., and Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal PrM., ocessing Wan, L., Zeiler, Zhang, LeCun, Y., Fergus, , 37S., , 328–339. 375and , 456 , 462 R. (2013). Regularization of neural net netw works using dropconnect. In ICML’2013 . 265 Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural Wang, and Manning, C. (2013). Fast drop dropout out training. In ICML’2013 . 264 netwS. orks using dropconnect. In ICML’2013 . 265 776out training. In ICML’2013 . 264 Wang, S. and Manning, C. (2013). Fast drop
BIBLIOGRAPHY
Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014a). Knowledge graph and text join jointly tly em emb bedding. In Pr Pro oc. EMNLP’2014 . 487 Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014a). Knowledge graph and text jointly EMNLP’2014 Wang, Z., Zhang, J.,oc.Feng, J., and Chen, Knowledge wledge graph em emb bedding by embedding. In Pr . 487 Z. (2014b). Kno translating on hyperplanes. In Pr Pro oc. AAAI’2014 . 487 Wang, Z., Zhang, J., Feng, J., and Chen, Z. (2014b). Knowledge graph embedding by ProCourville, c. AAAI’2014 Warde-F arde-Farley arley arley,,on D.,hyperplanes. Go Goo odfellow, In I. J., A., and translating . 487Bengio, Y. (2014). An empirical analysis of dropout in piecewise linear netw networks. orks. In ICLR’2014 . 261, 265, 266 Warde-Farley, D., Goodfellow, I. J., Courville, A., and Bengio, Y. (2014). An empirical Wawrzynek, Asanovic, K., Kingsbury Kingsbury, , B.,orks. Johnson, D., Beck, J.,, and N. analysis of J., dropout in piecewise linear netw In ICLR’2014 . 261 265, Morgan, 266 (1996). Sp Spert-I ert-I ert-II: I: A vector micropro microprocessor cessor system. Computer , 29(3), 79–86. 454 Wawrzynek, J., Asanovic, K., Kingsbury, B., Johnson, D., Beck, J., and Morgan, N. Computer W(1996). eav eaver, er, L.Spand TI: ao,AN. (2001). The optimal rew reward ard baseline for, 29 gradien gradient-based t-based454 reinforceert-I vector micropro cessor system. (3), 79–86. men mentt learning. In Pr Pro oc. UAI’2001 , pages 538–545. 693 Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforceProSaul, c. UAI’2001 Weinberger, K. Q.Inand L. K. (2004). Unsupervised ervised ment learning. , pagesUnsup 538–545. 693 learning of image manifolds by semidefinite programming. In CVPR’2004 , pages 988–995. 163, 522 Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by CVPR’2004 Weiss, Y., T Torralba, orralba, A., andInFergus, R. (2008). semidefinite programming. , pages Spectral 988–995. hashing. 163, 522 In NIPS , pages 1753–1760. 528 Weiss, Y., Torralba, A., and Fergus, R. (2008). Spectral hashing. In NIPS , pages Welling, M., Zemel, Hinton, ton, G. E. (2002). Self sup supervised ervised boosting. In Advanc dvances es 1753–1760. 528 R. S., and Hin in Neur Neural al Information Pr Pro ocessing Systems Systems,, pages 665–672. 705 Welling, M., Zemel, R. S., and Hinton, G. E. (2002). Self supervised boosting. In Advances in NeurM., al Information ocessing Systems , S. Welling, Hin Hinton, ton, G. Pr E., and Osindero, (2003a). Learning topographic ographic pages 665–672. 705 sparse top represen representations tations with products of Student-t distributions. In NIPS’2002 . 682 Welling, M., Hinton, G. E., and Osindero, S. (2003a). Learning sparse topographic Welling, M., Zemel,with R., products and Hinton, G. E. (2003b). Self-sup Self-supervised ervised boosting. In S. Beck Becker, er, represen tations of Student-t distributions. In NIPS’2002 . 682 S. Thrun, and K. Ob Obermay ermay ermayer, er, editors, Advanc dvances es in Neur Neural al Information Pr Pro ocessing Welling, M.,15Zemel, R., and Hinton, G. E. MIT (2003b). Self-sup Systems (NIPS’02) (NIPS’02), , pages 665–672. Press. 626 ervised boosting. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15Rosen-Zvi, (NIPS’02)M., Welling, M., and665–672. Hin Hinton, ton, G. E. Press. (2005).626 Exponential family harmoniums , pages MIT with an application to information retriev retrieval. al. In L. Saul, Y. Weiss, and L. Bottou, Welling, Rosen-Zvi, M.,aland Hinton, G.Pr E. (2005).Systems Exponential family harmoniums editors,M., Advanc dvances es in Neur Neural Information Pro ocessing 17 (NIPS’04) (NIPS’04), , volume 17, with an application information retrieval. In L. Saul, Y. Weiss, and L. Bottou, Cam Cambridge, bridge, MA. MITtoPress. 678 editors, Advances in Neural Information Processing Systems 17 (NIPS’04), volume 17, Werb erbos, os, P P.. J. MA. (1981). advances ances in nonlinear sensitivity analysis. In Cam bridge, MITApplications Press. 678 of adv Pr Pro oceedings of the 10th IFIP Confer Conferenc enc ence, e, 31.8 - 4.9, NYC , pages 762–770. 225 Werbos, P. J. (1981). Applications of advances in nonlinear sensitivity analysis. In ProceeJ., dings of theS., 10th Confer e, 31.8 - 4.9, NYC Weston, Bengio, andIFIP Usunier, N.enc (2010). Large scale image annotation: to , pages 762–770. learning 225 rank with join jointt word-image embeddings. Machine Learning , 81(1), 21–35. 403 Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to MachineMemory Learningnetw Weston, J., join Chopra, S., and eBordes, A. (2014). networks. orks.21–35. arXiv preprint eprint rank with t word-image mbeddings. , 81(1), 403pr arXiv:1410.3916 . 421, 488 Weston, J., Chopra, S., and Bordes, A. (2014). Memory networks. arXiv preprint arXiv:1410.3916 Widro Widrow, w, B. and Hoff, M., 488 E. (1960). Adaptiv Adaptivee switc switching hing circuits. In 1960 IRE WESCON . 421 Convention Recor ord d , volume 4, pages 96–104. IRE, New York. 15, 21, 24, 27 Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON Convention Record , volume 4, pages 96–104. IRE, New York. 15, 21, 24, 27 777
BIBLIOGRAPHY
Wikip Wikipedia edia (2015). List of animals by num numb ber of neurons — Wikip Wikipedia, edia, the free encyclop encyclopedia. edia. [Online; accessed 4-Marc 4-March-2015]. h-2015]. 24, 27 Wikipedia (2015). List of animals by number of neurons — Wikipedia, the free encyclopedia. Williams, K. I. and Agako Agakov, v, F. V. Products of Gaussians and Probabilistic [Online; C. accessed 4-Marc h-2015]. 24,(2002). 27 Minor Component Analysis. Neur Neural al Computation Computation,, 14(5), 1169–1182. 684 Williams, C. K. I. and Agakov, F. V. (2002). Products of Gaussians and Probabilistic NeurC. al Computation Williams, C. K. I. and Rasmussen, E. (1996). Gaussian processes cesses for684 regression. In Minor Component Analysis. , 14(5), pro 1169–1182. D. Touretzky ouretzky,, M. Mozer, and M. Hasselmo, editors, Advanc dvances es in Neur Neural al Information Williams, C. Systems K. I. and8 Rasmussen, C. E. 514–520. (1996). Gaussian proCambridge, cesses for regression. Pr Pro ocessing (NIPS’95) (NIPS’95),, pages MIT Press, MA. 142 In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances in Neural Information Processing 8 (NIPS’95) Williams, R. Systems J. (1992). Simple statistical gradient-follo gradient-following wing algorithms , pages 514–520. MIT Press, Cambridge,connectionist MA. 142 reinforcemen reinforcementt learning. Machine Learning , 8, 229–256. 690, 691 Williams, R. J. (1992). Simple statistical gradient-following algorithms connectionist Learning Williams, R. J. tand Zipser,Machine D. (1989). A learning algorithm690 for, 691 contin continually ually running fully reinforcemen learning. , 8, 229–256. recurren recurrentt neural net netw works. Neur Neural al Computation Computation,, 1, 270–280. 222 Williams, R. J. and Zipser, D. (1989). A learning algorithm for continually running fully Neur Computation Wilson, D.tR. and Martinez, R.al(2003). The general inefficiency recurren neural networks. T. , 1, 270–280. 222 of batch training for gradien gradientt descen descentt learning. Neur Neural al Networks , 16(10), 1429–1451. 279 Wilson, D. R. and Martinez, T. R. (2003). The general inefficiency of batch training for Neur al Networks 16(10), for Wilson, J.t R. (1984). Variance reduction tec techniques digital simulation. Americ merican an gradien descen t learning. ,hniques 1429–1451. 279 Journal of Mathematic Mathematical al and Management Scienc Sciences es es,, 4(3), 277––312. 692 Wilson, J. R. (1984). Variance reduction techniques for digital simulation. American Journal Mathematic al T. andJ.Management es analysis: Wisk Wiskott, ott, L.ofand Sejno Sejnowski, wski, (2002). SlowScienc feature Unsupervised , 4(3), 277––312. 692 learning of in inv variances. Neur Neural al Computation Computation,, 14(4), 715–770. 497 Wiskott, L. and Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of Neur al Computation Wolp olpert, D. and MacReady MacReady, , W. (1997). No 715–770. free lunch497 theorems for optimization. IEEE invert, ariances. , 14(4), Transactions on Evolutionary Computation Computation,, 1, 67–82. 293 Wolpert, D. and MacReady, W. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Wolp olpert, ert, D. H. (1996). The lack ofComputation a priori distinction bet etw w293 een learning algorithms. Neur Neural al , 1, 67–82. Computation Computation,, 8(7), 1341–1390. 116 Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation Wu, R., Yan, S.,, 8 Shan, Y., Dang, Q., (7), 1341–1390. 116and Sun, G. (2015). Deep image: Scaling up image recognition. arXiv:1501.02876. 450 Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image Wu, Z. (1997). arXiv:1501.02876. Global con contin tin tinuation uation recognition. 450for distance geometry problems. SIAM Journal of Optimization Optimization,, 7, 814–836. 328 Wu, Z. (1997). Global continuation for distance geometry problems. SIAM Journal of Optimization Xiong, H. Y., Barash, Y., and328 Frey rey,, B. J. (2011). Ba Bay yesian prediction of tissue-regulated , 7, 814–836. splicing using RNA sequence and cellular con context. text. Bioinformatics , 27(18), 2554–2562. Xiong, 264 H. Y., Barash, Y., and Frey, B. J. (2011). Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context. Bioinformatics , 27(18), 2554–2562. Xu, K., Ba, J. L., Kiros, R., Cho, K., Courville, A., Salakh Salakhutdino utdino utdinov, v, R., Zemel, R. S., and 264 Bengio, Y. (2015). Sho Show, w, attend and tell: Neural image caption generation with visual Xu, K., tion. Ba, J.InL., Kiros, R., Cho, K., Courville,. A., utdinov, R., Zemel, R. S., and atten attention. ICML’2015, arXiv:1502.03044 102,Salakh 411, 693 Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual ICML’2015, Yildiz, B., In Jaeger, H., andarXiv:1502.03044 Kieb Kiebel, el, S. J. (2012). property.. attenI. tion. . 102,Re-visiting 411, 693 the echo state property Neur Neural al networks , 35, 1–9. 407 Yildiz, I. B., Jaeger, H., and Kiebel, S. J. (2012). Re-visiting the echo state property. Neural networks , 35, 1–9. 407 778
BIBLIOGRAPHY
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural net netw works? In NIPS’2014 . 324, 539 Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features Younes, (1998).net On the conv convergence ergence of Marko stochastic chastic algorithms with rapidly in deepL.neural works? In NIPS’2014 .Markovian 324, vian 539 sto decreasing ergo ergodicit dicit dicity y rates. In Sto Stochastics chastics and Sto Stochastics chastics Mo Models dels dels,, pages 177–228. 615 Younes, L. (1998). On the convergence of Markovian stochastic algorithms with rapidly Sto(2010). chastics and Stotial chastics Models Yu,decreasing D., Wang, and Deng,In L. Sequen Sequential labeling using deep-structured ergoS., dicit y rates. , pages 177–228. 615 conditional random fields. IEEE Journal of Sele Selecte cte cted d Topics in Signal Pr Pro ocessing essing.. 324 Yu, D., Wang, S., and Deng, L. (2010). Sequential labeling using deep-structured IEEE Journal of Sele d TopicsarXiv in Signal Processing Zarem Zaremba, ba, W. and Sutsk Sutskev ev ever, er, I. (2014). Learning tocte execute. 1410.4615. 330 . 324 conditional random fields. Zarem Zaremba, Sutskev Sutskever, er, I. I. (2014). (2015). Learning Reinforcemen Reinforcement t learning neural Turing 330 machines. ba, W. and Sutsk ever, to execute. arXiv 1410.4615. arXiv:1505.00521 . 422 Zaremba, W. and Sutskever, I. (2015). Reinforcement learning neural Turing machines. arXiv:1505.00521 Zasla Zaslavsky vsky vsky,, T. (1975).. 422 Facing Up to Arr Arrangements: angements: Fac ace-Count e-Count Formulas for Partitions of Sp Spac ac acee by Hyp Hyperplanes erplanes . Number no. 154 in Memoirs of the American Mathematical Facing Up to So Arr angements: Face-Count Formulas for Partitions Zasla vsky T. (1975).Mathematical So Societ ciet ciety y., American Society ciety ciety. . 553 of Space by Hyperplanes . Number no. 154 in Memoirs of the American Mathematical Zeiler, M.y.D. and Fergus, R. (2014). Visualizing conv volutional netw networks. orks. Societ American Mathematical Society. 553and understanding con In ECCV’14 . 6 Zeiler, M. D. and Fergus, R. (2014). Visualizing and understanding convolutional networks. Zeiler, M. D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, In ECCV’14 . 6 A., Vanhouck anhoucke, e, V., Dean, J., and Hin Hinton, ton, G. E. (2013). On rectified linear units for Zeiler, M. D., Ranzato, Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, sp speec eec eech h processing. In M., ICASSP 2013 . 462 A., Vanhoucke, V., Dean, J., and Hinton, G. E. (2013). On rectified linear units for ICASSP 2013 Zhou, B., Khosla, A., In Lap Lapedriza, edriza, A., Oliv Oliva, a, A., and Torralba, A. (2015). Ob Object ject detectors speec h processing. . 462 emerge in deep scene CNN CNNs. s. ICLR’2015, arXiv:1412.6856. 554 Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2015). Object detectors Zhou, J. and Troy royansk ansk anska ayCNN a, O.s.G. (2014). Deep sup supervised ervised and conv convolutional olutional generative emerge in deep scene ICLR’2015, arXiv:1412.6856. 554 sto stocchastic net netw work for protein secondary structure prediction. In ICML’2014 . 717 Zhou, J. and Troyanskaya, O. G. (2014). Deep supervised and convolutional generative ICML’2014 Zhou, Y. andnet Chellappa, R. (1988). Computation of optical flow In using a neural .netw network. stochastic work for protein secondary structure prediction. 717ork. In Neur Neural al Networks, 1988., IEEE International Confer Conferenc enc encee on on,, pages 71–78. IEEE. 340 Zhou, Y. and Chellappa, R. (1988). Computation of optical flow using a neural network. Networks, 1988., International Confer enc e onorks Zöhrer, M.al and Pernk ernkopf, opf, F. IEEE (2014). General sto stochastic chastic netw networks for 71–78. classification. In In Neur , pages IEEE. 340 NIPS’2014 . 717 Zöhrer, M. and Pernkopf, F. (2014). General stochastic networks for classification. In NIPS’2014 . 717
779
Index Index
0-1 loss, 104, 276
Bag of words, 474 Bagging, 255 Absolute value rectification, 192 0-1 loss, 104 , 276 Bag h of normalization, words, 474 Batc Batch 266, 428 Accuracy ccuracy,, 426 Bagging, 255117 Ba Bay yes error, Absolute rectification, 192 A ctiv ctivation ationvalue function, 170 Batc h normalization, 266, 428 Ba rule, 70 Bay yes’ 117 ccuracy , 426 t, 95 Activ ctive e constrain constraint, Ba y es error, Bayesian hyperparameter optimization, 439 AdaGrad, ctivation 307 function, 170 yes’ rule, 70 Ba Bay esian net network, work, se seee directed graphical Activ e constrain t, 95 e linear elemen AD ADALINE, ALINE, se seee adaptiv adaptive elementt Bayesian mo hyperparameter optimization, 439 model del se e daGrad, Adam, 308307 , 428 Ba y esian net work, directed graphical Bay probabilit probability y, 55 se e AD ALINE, adaptiv e linear elemen t Adaptiv daptivee linear elemen element, t, 15, 24, 27 del Ba Bay yesian mo statistics, 135 Adv dam, 308, example, 428 dversarial ersarial 267 Bayesian probabilit y, 55 graphical mo Belief net netw w ork, se seee directed model del 135 daptiv e linear elemen t, ,15 , 24 , 27 Adv dversarial ersarial training, 268 270 , 533 Ba y esian statistics, Bernoulli distribution, 62 Adversarial Affine, 110 example, 267 Belief work, see directed graphical model BF BFGS, GS,net 316 A dv ersarial training, 268 , 270 , 533 AIS, se seee annealed imp importance ortance sampling Bernoulli 62 Bias, 124, distribution, 229 Affine, 110 Almost ev everywhere, erywhere, 71 BF GS, 316 Bias parameter, 110 AIS, seesure annealed importance Almost con conv vergence, 130 sampling Bias, 124 , 229 Biased imp importance ortance sampling, 596 Almost everywhere, Ancestral sampling, 71 583, 598 Bias parameter, 110 Bigram, 465 Almostse sure convergence, ANN, see e Artificial neural130 net netw work Biased imp ortance Binary relation, 485sampling, 596 Ancestral sampling, 583sampling, , 598 Annealed imp importance ortance 628, 670, Blo Bigram, 465 sampling, 602 Bloc ck Gibbs ANN, see719 Artificial neural network Binary relation, 485 Boltzmann distribution, 573 Annealed impBa ortance 628718 , 670, Boltzmann Appro Approximate ximate Bay yesian sampling, computation, Block Gibbsmac sampling, 602 machine, hine, 573, 656 719 inference, 586 Appro Approximate ximate Boltzmann 573 through time BPTT, se seee distribution, bac back-propagation k-propagation Approximate Bayesian1 computation, 718 Artificial in intelligence, telligence, Boltzmann mac Broadcasting, 34hine, 573, 656 e back-propagation through time Approximate inference, 586se Artificial neural net netw work, seee Neural net- Burn-in, BPTT, se600 Artificial winork telligence, 1 Broadcasting, 34 se e Artificial neural net w ork, Neural netASR, se seee automatic sp speec eec eech h recognition CAE, se seee600 con contractiv tractiv tractivee auto autoenco enco encoder der Burn-in, work un Asymptotically unbiased, biased, 124 Calculus of variations, 179 e automatic ASR, se102 CAE, see con tractive auto der Audio, , 361, 461 speech recognition Categorical distribution, se seeenco e multinoulli disAsymptotically un biased, 124 Calculus of v ariations, 179 Auto 4 , 357 , 505 tribution Autoenco enco encoder, der, see multinoulli disAudio, 102,sp 361 , h461 Categorical distribution, Automatic speec eec eech recognition, 461 CD, se seee con contrastiv trastiv trastivee div divergence ergence 505 Autoencoder, 4, 357, Cen Centering teringtribution tric trick k (DBM), 675 se e Bac Back-propagation, k-propagation, 203 Automatic speech recognition, 461 CD, con trastiv e div63 ergence Cen limit theorem, Central tral Bac Back-propagation k-propagation through time, 385 Cen tering tric k (DBM), 675 Chain rule (calculus), 206 Bac k-propagation, 203 Backprop, kprop, se seee bac back-propagation k-propagation Centralrule limit 63 Chain of theorem, probabilit probability y, 59 Back-propagation through time, 385 Chain rule (calculus), 206 780Chain rule of probability, 59 Backprop, see back-propagation
780
INDEX
Chess, 2 Critical temp temperature, erature, 606 Chord, 582 Cross-correlation, 333 Chess, 2 graph, 582 Critical temp Chordal Cross-en Cross-entrop trop tropy yerature, , 75, 132606 Chord, 582 Cross-correlation, 333 Class-based language mo models, dels, 466 Cross-v Cross-validation, alidation, 122 75 Chordal graph, 582 system, 376 Cross-en y, , 132 temporal classificaClassical dynamical CTC, se seeetrop connectionist Class-based language models, 466 Cross-validation, 122 Classification, 100 tion se e Classical dynamical system, 376 CTC, connectionist Clique poten se factor (graphical mo Curriculum learning, 329temporal classificaotential, tial, seee model) del) Classification, 100 tion CNN, se seee con conv volutional neural net netw work Curse of dimensionalit dimensionality y, 154 see factor Clique orativ potential, (graphical model) Cyc, Curriculum learning, 329 Collab Collaborativ orative e Filtering, 481 2 CNN, seese con volutional aneural Curse of dimensionality, 154 Collider, see e explaining way network D-separation, 575 Collab orativ e Filtering, 481 Cyc, 2 Color images, 361 se e D AE, se see e denoising auto autoenco enco encoder der Collider, cell,explaining away Complex 366 D-separation, 575 Data generating distribution, 111, 131 Color images, 361 Computational graph, 204 see denoising DAE,generating autoenco Data pro process, cess, 111 der Complex cell, 366 Computer vision, 455 111, 131 generating distribution, Data parallelism, 450 Computational graph, 204 Concept drift, 541 Data generating process, 111 Dataset, 105 Computer nvision, Condition um umb ber,455 279 Data parallelism, 450 270, 460 augmen augmentation, tation, Concept drift, 541 Conditional computation, se seee dynamic struc- Dataset Dataset, 105 DBM, se see e deep Boltzmann mac machine hine Conditionture number, 279 Dataset augmen tation, 270 , 460 se e dynamic struc- DCGAN, 554, 555, 703 Conditional computation, indep independence, endence, xiii , 60 DBM, seetree, deep145 Boltzmann machine Decision , 551 ture Conditional probabilit probability y, 59 DCGAN, Deco Decoder, der, 4554, 555, 703 independence, xiii, 60 Conditional RBM, 687 145 Decision tree, , 551 Deep belief netw 27, 532, 634, 659, 662, network, ork, Conditional probabilit Connectionism, 17, 446y, 59 Deco der, 4 686, 694 Conditional RBM, 687 Connectionist temp temporal oral classification, 463 b elief Deep Blue, 2network, 27, 532, 634, 659, 662, Connectionism, 17 , 446 Consistency Consistency,, 130, 516 686 , 694 mac Deep Boltzmann machine, hine, 24, 27, 532, 634, Connectionistoptimization, temporal classification, 463 Constrained 93, 237 Deep Blue, 2 655, 659, 665, 674, 686 Consistency , 130 , 516 Con Conten ten tent-based t-based addressing, 422 Boltzmann hine, , 27 , 532, 634, Deep feedforw feedforward ardmac net netw w ork, 24 167 , 428 Constrained optimization, 93 , 237 Con Conten ten tent-based t-based recommender systems, 483 655 , 659 , 665 , 674 , 686 Deep learning, 2 , 5 tent-based 422 576 Con Context-sp text-sp text-specific ecificaddressing, indep independence, endence, Deep feedforw ard net work, , 428 auto autoenco enco encoder, der, 513,167 691 tent-based recommender systems, 483 Denoising Con Contextual textual bandits, 483 Deep learning, , 5 hing, 622 Denoising score2matc matching, Con text-sp ecific indep 576 Contin tin tinuation uation metho methods, ds,endence, 328 Denoising auto enco der, Densit Density y estimation, 103 513, 691 textuale bandits, 483 Con Contractiv tractiv tractive auto autoenco enco encoder, der, 524 Denoising score matc Deriv Derivativ ativ ative, e, xiii, 83 hing, 622 tinuation Con Contrast, trast, 457 methods, 328 Densitymatrix, estimation, Design 106 103 Con tractiv auto encoder, 524 Contrastiv trastiv trastivee div divergence, ergence, 291 , 613, 674 Deriv ativ e, xiii , 83 Detector la lay y er, 340 trast, 457 Con Conv v ex optimization, 141 106 Design matrix, Determinan Determinant, t, xii trastiv e div ergence, Con Conv v olution, 331 , 685 291, 613, 674 Detector la y er, 340 Diagonal matrix, 41 Con v ex optimization, 141 Convolutional net netw work, 16 Determinan t, xii Differen Differential tial en entrop trop tropy y, 74, 649 olution, 331 , 685net Con Conv volutional neural network, work, 252, 331, 428, Diagonal matrix, 41 Dirac delta function, 65 Convolutional 463 network, 16 331 Differen tial en trop y , 74 , 649 model, del, 77, 510, 566, 694 Con volutional neural network, , 428, Directed graphical mo Co Coordinate ordinate descen descent, t, 322 , 673 252, Dirac delta function, 65 Directional deriv derivativ ativ ative, e, 85 463 Correlation, 61 Directed graphical model, 77se , 510 , 566 , 694 Discriminativ Discriminative e fine-tuning, see e sup supervised ervised Coordinate descen 322 , 673 Cost function, se seee t,ob objectiv jectiv jective e function Directional deriv ativ e, 85 fine-tuning Correlation, 61 , 61 Co Cov variance, xiii Discriminativ fine-tuning, se e Discriminativ Discriminative ee RBM, 688 see supervised Cost function, ob jectiv e function Co Cov variance matrix, 62 fine-tuning Distributed represen representation, tation, 17, 150, 549 xiii, 61 Co Cov variance, erage, 427 Discriminativ e RBM, 688 Domain adaptation, 539 Covariance matrix, 62 Distributed representation, 17, 150, 549 Coverage, 427 781 Domain adaptation, 539
INDEX
Dot pro product, duct, 34, 141 Double bac backprop, kprop, 270 Dot product, , 141 t matrix, 334 Doubly blo blocck 34 circulan circulant Double bac kprop, Dream sleep, 612, 270 655 Doubly block circulan t matrix, 334 DropConnect, 265 Dream sleep, , 655 Drop Dropout, out, 257612 , 428 , 433, 434, 674, 691 DropConnect, 265 451, 452 Dynamic structure, Dropout, 257, 428, 433, 434, 674, 691 E-step, 637 Dynamic structure, 451, 452 Early stopping, 246, 249, 272, 273, 428 E-step,se 637 EBM, see e energy-based mo model del Early stopping, 246 , 249 , 272 , 273, 428 Ec state net ork, 24 , 27 , 406 Echo ho netw w se e EBM, energy-based mo del Effectiv Effectivee capacit capacity y, 114 Echo state net work, 24 Eigendecomp Eigendecomposition, osition, 42, 27, 406 Effectiv e capacit Eigen Eigenv value, 42 y, 114 Eigendecomp osition, 42 Eigen Eigenv vector, 42 Eigenvalue, 42 ELBO, se seee evidence lo low wer bound Eigen vector, Elemen Element-wise t-wise42 pro product, duct, se seee Hadamard pro prodde evidence ELBO, seuct, lo w er b ound se seee Hadamard pro product duct see Hadamard prodElemen product,maximization EM, se seeet-wise exp expectation ectation see Hadamard product uct,519 Em Emb bedding, EM, see exp ectation maximization Empirical distribution, 66 Embedding, 519276 Empirical risk, distribution, 66 Empirical risk minimization, 276 Empirical risk, 276 Enco Encoder, der, 4 Empirical risk minimization, 276 Energy function, 572 Encoder, 4 mo Energy-based model, del, 572, 598, 656, 665 Energy function, 572255 Ensem Ensemble ble metho methods, ds, Energy-based Ep Epo och, 247 model, 572, 598, 656, 665 Ensemble methods,t, 255 Equalit Equality y constrain constraint, 94 Ep o c h, 247 Equiv 339 Equivariance, ariance, Equalit y constrain 94jectiv Error function, se seeet,ob objectiv jectivee function Equivariance, ESN, se seee ec echo ho 339 state net netw work e ob jective function Error function, Euclidean norm,se39 ESN, see echo state network Euler-Lagrange equation, 649 Euclideanlo norm, 39 Evidence low wer bound, 636, 663 Euler-Lagrange equation, 649 Example, 99 Evidence lo w er b Exp Expectation, ectation, 60 ound, 636, 663 Example, 99 maximization, 637 Exp Expectation ectation ectation, 60 se Exp Expected ected value, seee exp expectation ectation Exp ectation maximization, 637 Explaining away, 577, 634, 647 se e Expected value, Exploitation, 484 expectation Explaining aw484 ay, 577, 634, 647 Exploration, Exploitation, 484 Exp Exponen onen onential tial distribution, 65 Exploration, 484 Exponential distribution, 65
F-score, 426 Factor (graphical mo model), del), 570 F-score, 426 F actor analysis, 493 Factor graph, (graphical 582 model), 570 actor analysis, 493 4 Factors of variation, actor graph, Feature, 99 582 actors of variation, Feature selection, 2364 Feedforw eature, 99 eedforward ard neural net netw work, 167 Feature selection, Fine-tuning, 324 236 Feedforw ard neural442 network, 167 Finite differences, Fine-tuning, 324 Forget gate, 306 Finite differences, 442 203 F orw orward ard propagation, orget gate, 306 361, 363 Fourier transform, Foorw ard propagation, 203 vea, 367 Fourier 617 transform, 361, 363 FPCD, ovea, 367 , 574, 682 Free energy energy, FPCD, 617 Freebase Freebase, , 486 574, 682 y, 55 ree energy Frequen requentist tist ,probabilit probability Freebase, 486 Frequen requentist tist statistics, 135 Frob requen tistnorm, probabilit robenius enius 46 y, 55 135 requentist statistics, Fully-visible Ba Bay yes net netw work, 707 robenius norm, 46 es, 648 Functional deriv derivativ ativ atives, Fully-visible Bayes netwBa ork, FVBN, se seee fully-visible Bay yes707 net netw work Functional derivatives, 648 see fully-visible Gab Gabor or function, 369 FVBN, Bayes network GANs, se seee generativ generativee adv adversarial ersarial net netw works Gabor recurren function,t 369 Gated recurrent unit, 428 GANs, seedistribution, generative adv works Gaussian se seeeersarial normalnet distribuGated recurren t unit, 428 tion Gaussian kdistribution, ernel, 142 see normal distribuGaussiantion mixture, 67, 188 Gaussian ernel, con 142trast normalization GCN, se seee kglobal contrast Gaussian mixture, GeneOntology GeneOntology, , 48667, 188 GCN, see global110 contrast normalization Generalization, GeneOntology, 486 Generalized Lagrange function, se seee generalGeneralization, 110 ized Lagrangian see generalfunction, Generalized Lagrange Lagrangian, 94 ized Lagrangian Generativ Generative e adv adversarial ersarial net netw works, 691, 702 Generalized Lagrangian, 94 Generativ Generativee moment matching netw networks, orks, 705 Generativenet adv Generator netw wersarial ork, 695networks, 691, 702 Generativ e moment 571 matching networks, 705 Gibbs distribution, Generator net w ork, Gibbs sampling, 584695 , 602 Gibbs distribution, 571 Global con contrast trast normalization, 457 Gibbs se sampling, 584pro , 602 GPU, see e graphics processing cessing unit Global con trast normalization, 457 Gradien Gradient, t, 84 se e GPU, graphics processing unit 782 Gradient, 84
INDEX
Gradien Gradientt clipping, 289, 417 Information retriev retrieval, al, 528 Gradien Gradientt descen descent, t, 83, 85 Initialization, 301 Gradienxii t clipping, 289, 417 Information Graph, In Integral, tegral, xiii retrieval, 528 Gradien t descen t, 83 , 85 Initialization, Graphical mo model, del, se seee structured probabilis- In Inv variance, 343301 Graph, xii Integral, xiii tic mo model del Isotropic, 65 Graphicalpro mocessing del, seeunit, structured Graphics processing 447 probabilis- Invariance, 343 Jacobian xiii, 72, 86 tic model 324 Isotropic, matrix, 65 Greedy algorithm, Join Joint t probabilit probability y , 57 Graphics pro cessing unit, 447 Greedy lay layer-wise er-wise unsup unsupervised ervised pretraining, Jacobian matrix, xiii , 72, 86 Greedy algorithm, 324 531 k -means, 365 , 549 Join t probabilit y , 57 er-wise ervised 324 pretraining, Greedy lay sup supervised ervisedunsup pretraining, k-nearest neighbors, 143, 551 531 Grid searc search, h, 435 k -means, 365, 549 Karush-Kuhn-T Karush-Kuhn-Tuc uc uck ker conditions, 95, 237 Greedy supervised pretraining, 324 k 143 -nearest neighbors, Karush–Kuhn–T Karush–Kuhn–Tuc uc uck ker, 94, 551 Hadamard product, duct, xii, 34 Grid search,pro 435 Karush-Kuhn-T uc k er conditions, Kernel (con (conv volution), 332, 333 95, 237 Hard tanh, 196 Karush–Kuhn–T ker, 94 mac machine, hine, uc 551 Hadamard prose duct, xii, 34 Boltzmann ma- Kernel Harmonium, see e restricted (con v olution), 332, 333 Kernel tric trick, k, 141 Hard tanh , 196 chine se e Kernel mac hine, 551 seee Karush–Kuhn–T Karush–Kuhn–Tuc uc uck ker Harmonium, Boltzmann ma- KKT, se Harmon Harmony y theory theory,, restricted 574 Kernelconditions, trick, 141 se seee Karush-Kuhn-T Karush-Kuhn-Tuck uck ucker er chine Helmholtz free energy energy,, se seee evidence low lower er KKT seeconditions KKT, Karush–Kuhn–T uc k er Harmonybtheory , 574 ound se KKT conditions, Karush-Kuhn-Tdiverucker div divergence, ergence, se see e eKullback-Leibler Helmholtz Hessian, 223free energy, see evidence lower KL conditions gence bound xiii, 87 Hessian matrix, KL div ergence, Kullback-Leibler diverKno Knowledge wledge base,se2e, 486 Hessian, 223 Heteroscedastic, 187 gence Krylo Krylov v metho methods, ds, 224 Hessian la matrix, Hidden lay yer, 6, xiii 167, 87 Kno wledge base, div 2, 486 Kullbac Kullback-Leibler k-Leibler divergence, ergence, xiii, 74 Heteroscedastic, Hill clim climbing, bing, 86 187 Krylo v metho ds, 224 Hidden la y er, 6 , 167 Hyp Hyperparameter erparameter optimization, 435 Lab Label el smo smoothing, othing,div 243ergence, xiii, 74 Kullbac k-Leibler Hill erparameters, climbing, 86 120, 433 Hyp Hyperparameters, Lagrange multipliers, 94, 649 erparameter 435 Hyp Hypothesis othesis space,optimization, 112, 118 Label smoothing, 243 Lagrangian, se seee generalized Lagrangian Hyperparameters, 120, 433 Lagrange m ultipliers, 94 , 649 LAPGAN, 704 i.i.d. assumptions, 111,, 118 122, 267 Hypothesis space, 112 see generalized Lagrangian, Laplace distribution, 65, 499Lagrangian , 500 Iden Identit tit tity y matrix, 36 LAPGAN, 704 Laten Latent t v ariable, 67 i.i.d. assumptions, 111, 122 , 267 ILSVR ILSVRC, C, se seee ImageNet Large-Scale Visual 65,167 Laplace distribution, 499, 500 La Lay yer (neural net netw work), Identity matrix, 36 Challenge Recognition e ImageNetVisual Latentse veariable, 67trast normalization see lo local cal con contrast ILSVRC, se Large-Scale Visual LCN, ImageNet Large-Scale Recognition La y er (neural net work), 167 Leaky ReLU, 192 Recognition Challenge Challenge, 23 se e LCN, lo cal con ImageNet Visual Recognition Leaky units, 409 trast normalization Immoralit Immorality yLarge-Scale , 580 Leaky ReLU, Learning rate,192 85 Challenge, 23595, 627, 700 Imp sampling, Importance ortance Leaky units, 409 Line searc search, h, 85 , 86, 93 Immoralit y , 580 Imp Importance ortance weigh eighted ted auto autoenco enco encoder, der, 700 Learning rate, 85 Linear com combination, bination, 37 Importance sampling, Indep Independence, endence, xiii, 60 595, 627, 700 Line searc h, 85, 86, 38 93 dep dependence, endence, Imp ortance eighiden ted tically autoenco der, 700 se Indep Independen enden endentt wand identically distributed, seee Linear com bination, 37492 Linear factor mo models, dels, Independence, xiii , 60 i.i.d. assumptions se e Linear dep endence, 38 regression, 107, 110, 140 enden iden tically distributed, Indep Independen endent comp componen onen onent t analysis, 494 endentt and Linearprediction, factor mo487 dels, 492 Link i.i.d. assumptions Indep Independen enden endent t subspace analysis, 496 Linear regression, Lipsc Lipschitz hitz constan constant, t,107 92 , 110, 140 Indep enden t componen t analysis, 494 Inequalit Inequality y constrain constraint, t, 94 Link prediction, 487 Lipsc Lipschitz hitz con contin tin tinuous, uous, 92 Independen t subspace Inference, 565 , 586, 634analysis, , 636, 638496 , 641, 651, Lipsc hitz constan t, 92 Liquid state mac machine, hine, 406 Inequality653 constraint, 94 Inference, 565, 586, 634, 636, 638, 641, 651, Lipschitz continuous, 92 783Liquid state machine, 406 653
INDEX
Lo Local cal conditional probability distribution, Mixture densit density y net netw works, 188 567 Mixture distribution, 66 conditional probability 459 distribution, Mixture mo densit net,w513 orks, 188 Lo Local cal con contrast trast normalization, model, del,y 188 567 Mixture distribution, 66, 551 Logistic regression, 3, 140, 140 of exp experts, erts, 453 Local consigmoid, trast normalization, 459 Mixture 188 513 Logistic 7, 67 MLP MLP, , se seeemo mdel, ultila ultilay yer, p erception Logistic regression, 3, 140 Mixture of Long short-term memory memory, , ,18140 , 25, 306, 411 21,exp 22,erts, 674 453, 551 411,, MNIST, Logistic sigmoid, 7, 67 MLP , se multilay255 er perception 428 Mo aveeraging, Model del 411 Long short-term memory , 18 , 25 , 306 , , MNIST, 21 , 22 , 674 Lo Loop, op, 582 Mo Model del compression, 451 428 propagation, 588 averaging, 255 Lo Loop op opy y belief Mo Model del iden identifiabilit tifiabilit tifiability y, 284 Loop,function, 582 compression,450 451 Loss se seee ob objectiv jectiv jectivee function Mo Model del parallelism, p opy b elief propagation, 588 Lo Mo del iden tifiabilit y , 284 L norm, 39 Momen Momentt matc matching, hing, 705 se e Loss function, ob jective memory function del parallelism, 450 verse, 45, 240 LSTM, se seee long short-term Mo Moore-P ore-P ore-Penrose enrose pseudoin pseudoinv L norm, 39 Moment matc hing, 705 Moralized graph, 580 e long short-term memory M-step, 637 LSTM, se Mo ore-P enrose pseudoin verse, 45DBM , 240 MP-DBM, se seee multi-prediction Mac Machine hine learning, 2 Moralized graph, 580 MRF (Marko (Markov v Random Field), se seee undiM-step, see mmo Mac Machine hine637 translation, 101 MP-DBM, ulti-prediction DBM rected model del Machine learning, Main diagonal, 33 2 MRF se (Marko v Random Field), see undiMSE, see e mean squared error Machine translation, 101 Manifold, 160 rected model 542 Multi-mo Multi-modal dal learning, Main diagonal, 33 Manifold hyp ypothesis, othesis, 161 MSE, see mean squared error Multi-prediction DBM, 676 Manifold, learning, 160 Manifold 161 Multi-mo dal learning, 542 Multi-task learning, 245, 541 hypothesis, 161 270 Manifold tangen tangent t classifier, Multi-prediction DBM,5676 Multila Multilay yer perception, Manifold learning, 161138, 508 MAP appro approximation, ximation, Multi-task 24527 , 541 Multila Multilay yer plearning, erceptron, Manifold tangen t classifier, 270 Marginal probabilit probability y, 58 Multila y er p erception, 5 Multinomial distribution, 62 MAPoappro ximation, 138, 508 Mark Marko v chain, 598 Multilayer pdistribution, erceptron, 2762 Multinoulli Marginal probabilit y, Carlo, 58 Mark Marko ov chain Mon Monte te 598 Multinomial distribution, 62 Mark cnetw hain, 598se Markoov net work, seee undirected mo model del n -gram, 464distribution, 62 Multinoulli chain Mon te Carlo, 598 Mark Marko ov random field, se seee undirected mo model del NADE, 710 n 464 Markov xi net -gram, Matrix, ,w xiiork, , 32see undirected model Naiv Naive e Ba Bay yes, 3 se e Mark o v random field, undirected mo del NADE, 710 Matrix in inv verse, 36 Nat, 73 Matrix, pro xi, duct, xii, 3234 Naive Baimage, yes, 3 562 Matrix product, Natural Matrix inverse, Nat, 73 language pro Max norm, 40 36 Natural processing, cessing, 464 Matrix pro duct, 34 Natural image, 562 Max pooling, 340 Nearest neigh neighb bor regression, 115 Max norm, 40eliho Naturalelanguage Maxim Maximum um lik likeliho elihoo od, 131 Negativ Negative definite, pro 89 cessing, 464 Max pooling, 340 Nearest eneigh bor473 regression, Maxout, 192, 428 Negativ Negative phase, , 609, 611115 Maximum elihoovd,c131 Negativ e definite, MCMC, se seeelikMark Marko hain Mon Monte te Carlo Neo Neocognitron, cognitron, 16, 89 24, 27, 368 Maxout, 192641 , 428 Negativve momen phase, 473 , 609 Mean field, , 642, 674 Nestero Nesterov momentum, tum, 300, 611 see Mark MCMC, ov 108 chain Monte Carlo Neocognitron, 16, 24,256 27, 368 Mean squared error, Netflix Grand Prize, 482 Mean field, 641 , 642 , 674 Nestero v momen tum, 300 Measure theory 71 Neural language mo 466 , 479 theory, model, del, Mean squared error, 108 Netflix Grand Prize, 256 , 482 Measure zero, 71 Neural net netw work, 13 Measure net theory , 71419, 421 language mo del, 466 Memory netw work, Neural T uring mac machine, hine, 421, 479 Measure zero, 71 Neural net w ork, 13 Metho of steepest descent, se gradient Neuroscience, 15 Method d seee Memory descen net work, Neural Turing mac 421 descent t 419, 421 Newton’s metho method, d, hine, 89, 310 se e Metho d h, of 279 steepest descent, gradient NLM, Neuroscience, 15 language mo Minibatc Minibatch, se seee neural model del descent100 Newton’s method,language 89, 310 pro Missing inputs, NLP NLP, , se seee natural processing cessing se e Minibatc h, 279 NLM, neural language mo del Mixing (Mark (Marko ov chain), 604 No free lunc lunch h theorem, 116 Missing inputs, 100 NLP, see natural language processing 784 Mixing (Markov chain), 604 No free lunch theorem, 116
INDEX
Noise-con Noise-contrastiv trastiv trastivee estimation, 623 Prepro Preprocessing, cessing, 456 Non-parametric mo model, del, 114 Pretraining, 324, 531 Noise-con Preprocessing, Norm, xivtrastiv , 39 e estimation, 623 Primary visual 456 cortex, 366 114 Non-parametric mo del, Pretraining, 324 , 531 analysis, 48, 146–148, Normal distribution, 63, 64, 125 Principal comp components onents Norm, xiv , 39 Primary visual cortex, 366 Normal equations, 109, 109, 112, 234 493, 634 Normal distribution, 63, 64 , 125 Principal components analysis, 135 48, 146–148, Normalized initialization, 303 Prior probabilit probability y distribution, 109, 109se Normal equations, , 112 , 234 493 , 634 Numerical differentiation, finite differProbabilistic max p o oling, 685 seee 135 Normalized initialization, 303 Prior probabilit y distribution, ences Probabilistic PCA, 493, 494, 635 se e Numerical differentiation, finite differ- Probabilit Probabilistic max ypofunction, oling, 685 Probability y densit density 58 Ob Object ject detection, 456 ences Probabilistic PCA, 493, 494 Probabilit Probability y distribution, 56 , 635 Ob Object ject recognition, 456 Probabilit y mass densitfunction, y function, Probability 5658 ject detection, 456 Ob function, 82 Objectiv jectiv jectivee distribution, Probabilit Probability y mass function56estimation, 103 Ob jectk,recognition, 456 matching pursuit OMPse seee orthogonal Probabilit mass function, 56 Pro Product duct ofy exp experts, erts, 573 Ob jective learning, function,541 82 One-shot Probabilit y mass function estimation, Pro Product duct rule of probability probability,, se seee chain 103 rule k, see204 OMPorthogonal matching pursuit Op Operation, eration, Product of erts, 573 of exp probabilit probability y One-shot learning, 541 Optimization, 80, 82 Product of probability , see chain rule PSD, se seee rule predictiv predictive e sparse decomp decomposition osition Op eration, 204 Ortho Orthodo do dox x statistics, se seee frequentist statistics Pseudolik of probabilit y Pseudolikeliho eliho elihoo od, 618 Optimization, 80,hing 82 pursuit, 27, 254 Orthogonal matc matching PSD, see predictive sparse decomposition Orthodox statistics, Orthogonal matrix, se 42e frequentist statistics Quadrature pair, Pseudolikeliho od, 370 618 Orthogonal matc Orthogonalit Orthogonality y, 41hing pursuit, 27, 254 Quasi-Newton condition, 316 Orthogonal matrix, Quadrature pair, 370 ds, 316 Output la lay yer, 167 42 Quasi-Newton metho methods, Orthogonality, 41 Quasi-Newton condition, 316 P arallel la distributed processing, cessing, 17 Radial basis function, 196316 Output yer, 167 pro Quasi-Newton methods, Parameter initialization, 301, 408 Random searc search, h, 437 arallel distributed pro,cessing, 17, 376, 389 Random Radial basis function, Parameter sharing, 251 336, 374 variable, 56 196 301, 408sharing Random searc h, 437 Parameter tinitialization, ying, se seee Parameter Ratio matc matching, hing, 621 Parametric arameter sharing, 251, 336, 374, 376, 389 RBF, Random mo model, del, 114 196variable, 56 se e arameter tReLU, ying, 192Parameter sharing Ratio matc hing, 621 Boltzmann mac Parametric RBM, se seee restricted machine hine 114 arametric mo del, RBF, 196 Partial deriv derivativ ativ ative, e, 84 Recall, 426 seee field, Partition arametric ReLU, 192 RBM, restricted function, 571, 608, 671 Receptiv Receptive 338 Boltzmann machine Partialse deriv ative, 84 Recall, 426 PCA, see e principal comp componen onen onents ts analysis Recommender Systems, 481 Partition , 608um , 671lik Receptivelinear field, unit, 338 171, 192, 428, 510 PCD, se seee function, sto stocchastic571 maxim maximum likeliho eliho elihoo od Rectified se e PCA, principal comp onen ts analysis Recommender Systems, Perceptron, 15, 27 Recurren Recurrentt net netw work, 27 481 se e PCD, sto c hastic maxim um lik eliho o d Rectified linear , 192 Persisten ersistentt con contrastiv trastiv trastivee divergence, se seee stochas- Recurren Recurrentt neuralunit, net netw w171 ork, 379, 428, 510 Perceptron, , 27 um lik Recurrent net work, 27 tic 15 maxim maximum likeliho eliho elihoo od Regression, 101 se e ersistent conanalysis, trastive divergence, stochasRecurren t neural net, w120 ork, 379, 228, 433 Perturbation se reparametrization Regularization, 120 , 177 seee tic maxim um lik eliho o d Regression, 101 tric trick k Regularizer, 119 120, 120, 177, 228, 433 erturbation analysis, Regularization, Poin ointt estimator, 122 see reparametrization REINF REINFOR OR ORCE, CE, 691 trick Regularizer, 119 Policy olicy,, 483 Reinforcemen Reinforcement t learning, 25, 106, 483, 691 Pooin t estimator, 122 REINF OR CE, 691 486 oling, 331, 685 Relational database, olicy, e483 Reinforcement learning, , 106, 483, 691 Positiv ositive definite, 89 Reparametrization tric trick, k,25 690 ooling, 331, 685 Relational database, 486 3 Positiv ositive e phase, 473, 609, 611, 658, 670 Represen Representation tation learning, Positive definite, 89 Reparametrization trick,y690 Precision, 426 Represen Representational tational capacit capacity , 114 Positive phase, 473, 609distribution), , 611, 658, 670 Representation learning,machine, 3 Precision (of a normal 63, 65 Restricted Boltzmann 357, 462, Precision,e 426 Representational y, 114 Predictiv Predictive sparse decomp decomposition, osition, 526 482, 590capacit , 634, 658 , 659, 674, 678, Precision (of a normal distribution), 63, 65 Restricted Boltzmann machine, 357, 462, 785 Predictive sparse decomposition, 526 482, 590, 634, 658, 659, 674, 678,
INDEX
680, 683, 685 SPN, se seee sum-pro sum-product duct net netw work Ridge regression, se seee weigh eightt deca decay y Square matrix, 38 SPN, seese sum-pro netwrestricted ork Risk, 275680, 683, 685 ssRBM, see e spike duct and slab Boltzse e Ridge regression, w eigh t deca y Square matrix, 38 RNN-RBM, 688 mann mac machine hine e spike and Risk, 275 ssRBM, sedeviation, Standard 61slab restricted BoltzSaddle poin oints, ts, 285 RNN-RBM, 688 Standardmann error, mac 127hine Sample mean, 125 deviation, Standard error of the61mean, 128, 278 Saddle p ts,, 31 285 Scalar, xioin , xii Standard error, Statistic, 122 127 Sample mean, 125 Score matc matching, hing, 516, 620 Standard error of the mean, 128, 278 Statistical learning theory theory, , 110 Scalar, xi, xii, 31 316 Secan Secantt condition, Statistic, 122 t, se Steep Steepest est descen descent, seee gradien gradientt descen descentt Score matc hing, 516 Second deriv derivativ ativ ative, e, 86, 620 Statistical learning theory , 110 Sto Stocchastic bac back-propagation, k-propagation, se seee reparametrizaSecant condition, Second deriv derivativ ativ ativee 316 test, 89 Steepest tion descen t, ksee gradient descent tric trick Second derivative,7386 e reparametrizaSelf-information, chastic k-propagation, Sto Stoc hastic bac gradient descent,se15 , 150, 279, Secondtic deriv ative test, Seman Semantic hashing, 528 89 tion tric k 294 294,, 674 Self-information, 73 Semi-sup Semi-supervised ervised learning, 244 hastic maxim gradient 150 279, Sto Stocchastic maximum umdescent, lik likeliho eliho elihoo o15 d, ,615 , ,674 Semantic hashing, 528 363 Separable con conv volution, 674 265 Sto Stocchastic294 po,oling, Semi-supervised learning,mo 244deling), 575 Separation (probabilistic modeling), Stochasticlearning, maximum likelihood, 615, 674 Structure 585 Separable Set, xii convolution, 363 Stochastic poutput, ooling, 101 265, 687 Structured Separation (probabilistic motdeling), SGD, se seee sto stoc chastic gradien gradient descen descentt575 Structure learning, 585 mo Structured probabilistic model, del, 77, 561 Set, xii en Shannon entrop trop tropy y, xiii, 73 Structured output, 101 , 687 Sum rule of probabilit probability y 58 SGD, see 469 stochastic gradient descent Shortlist, Structured probabilistic model, 77, 561 Sum-pro Sum-product duct net netw work, 556 Shannon en xiii, 73 sigmoid Sigmoid, xivtrop , se seeey, logistic Sumervised rule offine-tuning, probability,532 58 , 664 Sup Supervised Shortlist,b469 Sigmoid elief net netw work, 27 Sum-pro duct network, 556 Sup Supervised ervised learning, 105 Sigmoid, xiv366 , see logistic sigmoid Simple cell, Sup ervised fine-tuning, , 664 Supp Support ort vector mac machine, hine,532 140 Sigmoid bvalue, elief net ork, 27 value decomp 105 Singular se seeewsingular decompoo- Surrogate Supervisedloss learning, function, 276 Simple cell, 366 sition Support ector mac hine,decomp 140 SVD, se seeevsingular value decomposition osition see singular value Singular value, alue decomp decomposition, osition, 44, decomp 148, 482o- Symmetric Surrogate loss function, 276 matrix, 41, 43 Singular sition vector, se seee singular value decom- SVD, see singular value decomposition Singular p value decomposition, 44, 148, 482 T osition angen angentt distance, Symmetric matrix,269 41, 43 see singular Singular vector, value decom- Tangen Slo Slow w feature analysis, 496 angentt plane, 519 osition distance, SML, se seeepsto stoc chastic maxim maximum um lik likeliho eliho elihoo od Tangen angentt prop, 269 269 Slo w feature analysis, 496 T angen t plane, 519 y neural net Softmax, 183, 421, 453 TDNN, se seee time-dela time-delay netw work se e SML, sto,c68 hastic angen prop, 269 Softplus, xiv , 196maximum likelihood Teac eacher hert forcing, 383, 384 see 606 Softmax, 183, 4213, 453 TDNN, time-delay neural network Spam detection, T emp empering, ering, Softplus, xiv , 68 , 196 T eac her forcing, 383,141 384 Sparse co coding, ding, 322, 357, 499, 634, 694 emplate matc matching, hing, Spam detection, 3 304, 408 empering, 606 Sparse initialization, Tensor, xi, xii , 33 ding, 322 , 357,146 499, ,226 634, , 253 694, 508, Test emplate matching, 141 Sparse co represen representation, tation, set, 110 Sparse initialization, 304, 408 T ensor, xi xii, 33 559 Tikhono se Tikhonov v ,regularization, seee weigh eightt deca decay y Sparse represen tation, 146 , 226 , 253 , 508 , T est set, 110 Sp Spearmin earmin earmint, t, 439 Tiled con conv volution, 353 see369 559 Tikhonov yregularization, weigh t decay Sp Spectral ectral radius, 407 Time-dela Time-delay neural net netw work, , 375 earmin t, 439 Tiled con v olution, 353 Sp recognition, se automatic speech T o eplitz matrix, 334 Speec eec eech h seee Spectral radius, 407 Time-dela y neural network, 369, 375 recognition T op opographic ographic ICA, 496 se e Speech recognition, oeplitz matrix,46 334 Sphering, se seee whitening automatic speech Trace op operator, erator, recognition opographic ICA, Spik Spikee and slab restricted Boltzmann ma- Training error, 110496 see whitening Sphering,chine, Transcription, race operator,101 46 683 Spike and slab restricted Boltzmann ma- Training error, 110 786 chine, 683 Transcription, 101
INDEX
Transfer learning, 539 WordNet WordNet,, 486 Transp ranspose, ose, xii, 33 WordNet, 486 Zero-data learning, se seee zero-shot learning ransfer inequalit learning,y539 Triangle inequality , 39 Zero-shot learning, 541 Triangulated ranspose, xiigraph, , 33 se seee chordal graph Zero-data learning, see zero-shot learning riangle inequalit y, 39 Trigram, 465 Zero-shot learning, 541 Triangulated graph, see chordal graph Un Unbiased, biased,465 124 Trigram, Undirected graphical mo model, del, 77, 510 Un biased, 124 Undirected mo model, del, 569 Undirected graphical mo Uniform distribution, 57 del, 77, 510 Undirected465 model, 569 Unigram, Uniform distribution, 57 Unit norm, 41 Unigram, 465 Unit vector, 41 Unit ersal norm,appro 41 ximation theorem, 197 Univ Universal approximation Unit ersal vector, 41 ximator, 556 Univ Universal appro approximator, Univ ersal appro ximationytheorem, 197 570 Unnormalized probabilit probability distribution, Universal appro ximator,105 556, 146 Unsup Unsupervised ervised learning, Unnormalized probability distribution, 570 Unsup Unsupervised ervised pretraining, 462, 531 Unsupervised learning, 105, 146 V-structure, see e explaining462 aw,ay Unsupervisedse pretraining, 531 V1, 366 see explaining V-structure, awder ay V AE, se seee variational auto autoenco enco encoder V1, 366 V apnik-Cherv apnik-Chervonenkis onenkis dimension, 114 see v VAE, ariational ariance, xiii , 61, 229autoencoder apnik-Cherv onenkis dimension, Variational auto autoenco enco encoder, der, 691, 698114 ariance, xiii , 61ativ , 229 Variational deriv derivativ atives, es, se seee functional deriv derivaa698 Variational auto enco der, 691 , tiv tives es e efunctional es, se aVariational ariational deriv free ativ energy energy, , se see evidencederiv low lower er tiv es b ound Variational freese energy , see evidence lowdier C dimension, see e Vapnik-Chervonenkis bound mension see Vapnik-Chervonenkis diC dimension, Vector, xi, xii, 32 mension Virtual adv adversarial ersarial examples, 268 V ector, xi xii,6 32 Visible la lay y,er, Virtual adversarial examples, 268 V olumetric data, 361 Visible layer, 6 W ak ake-sleep, e-sleep,data, 654, 663 Volumetric 361 Weigh eightt deca decay y, 118, 177, 231, 434 ake-sleep, , 663 , 284 Weigh eight t space654 symmetry symmetry, t deca , 118, 177, 231, 434 Weigh eights, ts, 15,y107 Weight space symmetry, 284 Whitening, 458 W eigh ts, 15 , 107 Wikibase, 486 Whitening, 458 Wikibase Wikibase, , 486 Wikibase, 486 W ord em emb bedding, 467 Wikibase, 486 Word-sense disam disambiguation, biguation, 487 Word embedding, 467 787 Word-sense disambiguation, 487