CS231N_17_KOR_SUB/eng/Lecture 7 _ Training Neural Networks II.srt at master · visionNoob/CS231N_17_KOR_SUB · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
-00:00:00,485 --> 00:00:05,515


2
00:00:08,355 --> 00:00:11,357
- Okay, it's after 12, so I think
we should get started.

3
00:00:14,644 --> 00:00:17,419
Today we're going to kind of pick up
where we left off last time.

4
00:00:17,419 --> 00:00:23,400
Last time we talked about a lot of sort of tips and tricks
involved in the nitty gritty details of training neural networks.

5
00:00:23,400 --> 00:00:30,439
Today we'll pick up where we left off, and talk about a lot more
of these sort of nitty gritty details about training these things.

6
00:00:30,439 --> 00:00:34,707
As usual, a couple administrative notes
before we get into the material.

7
00:00:34,707 --> 00:00:39,645
As you all know, assignment one is already
due. Hopefully you all turned it in.

8
00:00:39,645 --> 00:00:57,322
Did it go okay? Was it not okay? Rough sentiment? Mostly okay. Okay, that's good. Awesome. [laughs] We're in
the process of grading those, so stay turned. We're hoping to get grades back for those before A two is due.

9
00:00:57,322 --> 00:01:04,121
Another reminder, that your project proposals
are due tomorrow. Actually, no, today at 11:59.

10
00:01:04,959 --> 00:01:09,074
Make sure you send those in.
Details are on the website and on Piazza.

11
00:01:09,074 --> 00:01:15,269
Also a reminder, assignment two is already
out. That'll be due a week from Thursday.

12
00:01:15,269 --> 00:01:25,860
Historically, assignment two has been the longest one in the class, so if you haven't
started already on assignment two, I'd recommend you take a look at that pretty soon.

13
00:01:27,122 --> 00:01:32,484
Another reminder is that for assignment two, I
think of a lot of you will be using Google Cloud.

14
00:01:32,484 --> 00:01:38,586
Big reminder, make sure to stop your instances when you're not
using them because whenever your instance is on, you get charged,

15
00:01:38,586 --> 00:01:42,899
and we only have so many coupons
to distribute to you guys.

16
00:01:42,899 --> 00:01:52,223
Anytime your instance is on, even if you're not SSH to it, even if you're not running things
immediately in your Jupyter Notebook, any time that instance is on, you're going to be charged.

17
00:01:52,223 --> 00:01:57,118
Just make sure that you explicitly stop
your instances when you're not using them.

18
00:01:57,118 --> 00:02:04,970
In this example, I've got a little screenshot of my dashboard on Google Cloud.
I need to go in there and explicitly go to the dropdown and click stop.

19
00:02:04,970 --> 00:02:08,644
Just make sure that you do this when
you're done working each day.

20
00:02:09,481 --> 00:02:20,853
Another thing to remember is it's kind of up to you guys to keep track of your spending on Google
Cloud. In particular, instances that use GPUs are a lot more expensive than those with CPUs.

21
00:02:20,853 --> 00:02:28,322
Rough order of magnitude, those GPU instances are around 90
cents to a dollar an hour. Those are actually quite pricey.

22
00:02:28,322 --> 00:02:39,739
The CPU instances are much cheaper. The general strategy is that you probably want to make two instances,
one with a GPU and one without, and then only use that GPU instance when you really need the GPU.

23
00:02:39,739 --> 00:02:47,377
For example, on assignment two, most of the assignment, you should
only need the CPU, so you should only use your CPU instance for that.

24
00:02:47,377 --> 00:02:52,990
But then the final question, about
TensorFlow or PyTorch, that will need a GPU.

25
00:02:52,990 --> 00:02:58,897
This'll give you a little bit of practice with switching between
multiple instances and only using that GPU when it's really necessary.

26
00:02:58,897 --> 00:03:04,307
Again, just kind of watch your spending.
Try not to go too crazy on these things.

27
00:03:04,307 --> 00:03:07,748
Any questions on the administrative stuff
before we move on?

28
00:03:11,180 --> 00:03:12,182
Question.

29
00:03:12,182 --> 00:03:13,902
- [Student] How much RAM should we use?

30
00:03:13,902 --> 00:03:16,133
- Question is how much RAM should we use?

31
00:03:16,133 --> 00:03:21,863
I think eight or 16 gigs is probably good
for everything that you need in this class.

32
00:03:21,863 --> 00:03:27,114
As you scale up the number of CPUs and the number
of RAM, you also end up spending more money.

33
00:03:27,114 --> 00:03:34,542
If you stick with two or four CPUs and eight or 16 gigs of RAM, that
should be plenty for all the homework-related stuff that you need to do.

34
00:03:36,636 --> 00:03:40,417
As a quick recap, last time we
talked about activation functions.

35
00:03:40,417 --> 00:03:44,962
We talked about this whole zoo of different activation
functions and some of their different properties.

36
00:03:44,962 --> 00:03:59,736
We saw that the sigmoid, which used to be quite popular when training neural networks maybe 10 years ago or so, has this
problem with vanishing gradients near the two ends of the activation function. tanh has this similar sort of problem.

37
00:03:59,736 --> 00:04:09,230
Kind of the general recommendation is that you probably want to stick with ReLU for most cases
as sort of a default choice 'cause it tends to work well for a lot of different architectures.

38
00:04:09,230 --> 00:04:16,820
We also talked about weight initialization. Remember that up on
the top, we have this idea that when you initialize your weights

39
00:04:16,820 --> 00:04:23,787
at the start of training, if those weights are initialized to be
too small, then if you look at, then the activations will vanish

40
00:04:23,788 --> 00:04:29,583
as you go through the network because as you multiply by these small
numbers over and over again, they'll all sort of decay to zero.

41
00:04:29,583 --> 00:04:33,072
Then everything will be zero,
learning won't happen, you'll be sad.

42
00:04:33,072 --> 00:04:41,208
On the other hand, if you initialize your weights too big, then as you go through the
network and multiply by your weight matrix over and over again, eventually they'll explode.

43
00:04:41,208 --> 00:04:45,389
You'll be unhappy, there'll be no
learning, it will be very bad.

44
00:04:45,389 --> 00:04:58,531
But if you get that initialization just right, for example, using the Xavier initialization or the MSRA
initialization, then you kind of keep a nice distribution of activations as you go through the network.

45
00:04:58,531 --> 00:05:04,328
Remember that this kind of gets more and more important and
more and more critical as your networks get deeper and deeper

46
00:05:04,328 --> 00:05:11,620
because as your network gets deeper, you're multiplying by those weight
matrices over and over again with these more multiplicative terms.

47
00:05:11,620 --> 00:05:23,666
We also talked last time about data preprocessing. We talked about how it's pretty typical
in conv nets to zero center and normalize your data so it has zero mean and unit variance.

48
00:05:23,666 --> 00:05:29,968
I wanted to provide a little bit of extra intuition
about why you might actually want to do this.

49
00:05:29,968 --> 00:05:39,532
Imagine a simple setup where we have a binary classification problem where
we want to draw a line to separate these red points from these blue points.

50
00:05:39,532 --> 00:05:46,948
On the left, you have this idea where if those data points are kind
of not normalized and not centered and far away from the origin,

51
00:05:46,948 --> 00:05:55,007
then we can still use a line to separate them, but now if that line wiggles
just a little bit, then our classification is going to get totally destroyed.

52
00:05:55,007 --> 00:06:05,992
That kind of means that in the example on the left, the loss function is now extremely
sensitive to small perturbations in that linear classifier in our weight matrix.

53
00:06:07,315 --> 00:06:14,554
We can still represent the same functions, but that might make
learning quite difficult because, again, their loss is very sensitive

54
00:06:14,554 --> 00:06:25,351
to our parameter vector, whereas in the situation on the right, if you take that data cloud
and you move it into the origin and you make it unit variance, then now, again, we can still

55
00:06:25,351 --> 00:06:35,523
classify that data quite well, but now as we wiggle that line a little bit, then our
loss function is less sensitive to small perturbations in the parameter values.

56
00:06:35,523 --> 00:06:41,064
That maybe makes optimization a little bit
easier, as we'll see a little bit going forward.

57
00:06:41,064 --> 00:06:46,539
By the way, this situation is not only
in the linear classification case.

58
00:06:46,539 --> 00:06:57,756
Inside a neural network, remember we kind of have these interleavings of these linear
matrix multiplies, or convolutions, followed by non-linear activation functions.

59
00:06:59,078 --> 00:07:05,687
If the input to some layer in your neural network is not
centered or not zero mean, not unit variance, then again,

60
00:07:05,687 --> 00:07:15,632
small perturbations in the weight matrix of that layer of the network could cause large
perturbations in the output of that layer, which, again, might make learning difficult.

61
00:07:15,632 --> 00:07:20,481
This is kind of a little bit of extra intuition
about why normalization might be important.

62
00:07:21,864 --> 00:07:26,862
Because we have this intuition that normalization is
so important, we talked about batch normalization,

63
00:07:26,862 --> 00:07:36,030
which is where we just add this additional layer inside our networks to just
force all of the intermediate activations to be zero mean and unit variance.

64
00:07:36,030 --> 00:07:41,465
I've sort of resummarized the batch normalization equations
here with the shapes a little bit more explicitly.

65
00:07:41,465 --> 00:07:45,172
Hopefully this can help you out when you're
implementing this thing on assignment two.

66
00:07:45,172 --> 00:07:59,254
But again, in batch normalization, we have this idea that in the forward pass, we use the statistics of the mini batch
to compute a mean and a standard deviation, and then use those estimates to normalize our data on the forward pass.

67
00:07:59,254 --> 00:08:05,641
Then we also reintroduce the scale and shift
parameters to increase the expressivity of the layer.

68
00:08:05,641 --> 00:08:09,990
You might want to refer back to this
when working on assignment two.

69
00:08:09,990 --> 00:08:18,146
We also talked last time a little bit about babysitting the learning process,
how you should probably be looking at your loss curves during training.

70
00:08:18,146 --> 00:08:26,683
Here's an example of some networks I was actually training over the
weekend. This is usually my setup when I'm working on these things.

71
00:08:26,683 --> 00:08:35,795
On the left, I have some plot showing the training loss over time. You can see it's
kind of going down, which means my network is reducing the loss. It's doing well.

72
00:08:35,795 --> 00:08:48,464
On the right, there's this plot where the X axis is, again, time, or the iteration number,
and the Y axis is my performance measure both on my training set and on my validation set.

73
00:08:48,465 --> 00:08:58,680
You can see that as we go over time, then my training set performance goes up and up and up and up and
up as my loss function goes down, but at some point, my validation set performance kind of plateaus.

74
00:08:58,680 --> 00:09:05,066
This kind of suggests that maybe I'm overfitting in this situation.
Maybe I should have been trying to add additional regularization.

75
00:09:06,317 --> 00:09:09,504
We also talked a bit last time about
hyperparameter search.

76
00:09:09,504 --> 00:09:14,798
All these networks have sort of a large zoo of
hyperparameters. It's pretty important to set them correctly.

77
00:09:14,798 --> 00:09:20,725
We talked a little bit about grid search versus random search,
and how random search is maybe a little bit nicer in theory

78
00:09:20,725 --> 00:09:30,669
because in the situation where your performance might be more sensitive, with respect to one
hyperparameter than other, and random search lets you cover that space a little bit better.

79
00:09:30,669 --> 00:09:37,005
We also talked about the idea of coarse to fine search, where
when you're doing this hyperparameter optimization, probably you

80
00:09:37,005 --> 00:09:43,408
want to start with very wide ranges for your hyperparameters,
only train for a couple iterations, and then based on

81
00:09:43,408 --> 00:09:47,973
those results, you kind of narrow in on the
range of hyperparameters that are good.

82
00:09:47,973 --> 00:09:51,666
Now, again, redo your search in a
smaller range for more iterations.

83
00:09:51,666 --> 00:09:56,708
You can kind of iterate this process to kind of
hone in on the right region for hyperparameters.

84
00:09:56,708 --> 00:10:04,455
But again, it's really important to, at the start, have a very coarse range to
start with, where you want very, very wide ranges for all your hyperparameters.

85
00:10:04,455 --> 00:10:13,746
Ideally, those ranges should be so wide that your network is kind of blowing up at either end
of the range so that you know that you've searched a wide enough range for those things.

86
00:10:17,462 --> 00:10:18,295
Question?

87
00:10:20,044 --> 00:10:26,672
- [Student] How many [speaks too low to hear]
optimize at once? [speaks too low to hear]

88
00:10:31,840 --> 00:10:34,554
- The question is how many hyperparameters
do we typically search at a time?

89
00:10:34,554 --> 00:10:38,244
Here is two, but there's a lot more
than two in these typical things.

90
00:10:38,244 --> 00:10:45,442
It kind of depends on the exact model and the exact architecture, but because
the number of possibilities is exponential in the number of hyperparameters,

91
00:10:45,442 --> 00:10:48,012
you can't really test too many at a time.

92
00:10:48,012 --> 00:10:51,737
It also kind of depends on how
many machines you have available.

93
00:10:51,737 --> 00:10:55,745
It kind of varies from person to person
and from experiment to experiment.

94
00:10:55,745 --> 00:11:05,353
But generally, I try not to do this over more than maybe two or three or four at
a time at most because, again, this exponential search just gets out of control.

95
00:11:05,353 --> 00:11:10,406
Typically, learning rate is the really
important one that you need to nail first.

96
00:11:10,406 --> 00:11:19,542
Then other things, like regularization, like learning rate decay, model size, these
other types of things tend to be a little bit less sensitive than learning rate.

97
00:11:19,542 --> 00:11:22,723
Sometimes you might do kind of a block
coordinate descent, where you go and find

98
00:11:22,723 --> 00:11:27,459
the good learning rate, then you go back
and try to look at different model sizes.

99
00:11:27,459 --> 00:11:30,759
This can help you cut down on the
exponential search a little bit,

100
00:11:30,759 --> 00:11:35,370
but it's a little bit problem dependent on exactly which
ones you should be searching over in which order.

101
00:11:36,253 --> 00:11:38,120
More questions?

102
00:11:38,120 --> 00:11:57,041
- [Student] [speaks too low to hear] Another parameter, but then changing that other parameter, two or three other
parameters, makes it so that your learning rate or the ideal learning rate is still [speaks too low to hear].

103
00:11:57,041 --> 00:12:04,537
- Question is how often does it happen where when you change one hyperparameter,
then the other, the optimal values of the other hyperparameters change?

104
00:12:04,537 --> 00:12:11,339
That does happen sometimes, although for learning
rates, that's typically less of a problem.

105
00:12:11,339 --> 00:12:18,130
For learning rates, typically you want to get in a good range, and then set
it maybe even a little bit lower than optimal, and let it go for a long time.

106
00:12:18,130 --> 00:12:31,291
Then if you do that, combined with some of the fancier optimization strategies that we'll talk about today,
then a lot of models tend to be a little bit less sensitive to learning rate once you get them in a good range.

107
00:12:31,291 --> 00:12:32,962
Sorry, did you have a
question in front, as well?

108
00:12:32,962 --> 00:12:37,308
- [Student] [speaks too low to hear]

109
00:12:37,308 --> 00:12:41,292
- The question is what's wrong with having a small
learning rate and increasing the number of epochs?

110
00:12:41,292 --> 00:12:45,139
The answer is that it might take
a very long time. [laughs]

111
00:12:45,139 --> 00:12:48,383
- [Student] [speaks too low to hear]

112
00:12:48,383 --> 00:12:54,853
- Intuitively, if you set the learning rate very low and let it go
for a very long time, then this should, in theory, always work.

113
00:12:54,853 --> 00:13:00,491
But in practice, those factors of 10 or 100 actually
matter a lot when you're training these things.

114
00:13:00,491 --> 00:13:03,931
Maybe if you got the right learning rate,
you could train it in six hours, 12 hours

115
00:13:03,931 --> 00:13:11,911
or a day, but then if you just were super safe and dropped it by a factor of 10
or by a factor of 100, now that one-day training becomes 100 days of training.

116
00:13:11,911 --> 00:13:16,400
That's three months.
That's not going to be good.

117
00:13:16,400 --> 00:13:20,668
When you're taking these intro computer science classes, they
always kind of sweep the constants under the rug, but when

118
00:13:20,668 --> 00:13:25,444
you're actually thinking about training things,
those constants end up mattering a lot.

119
00:13:25,444 --> 00:13:26,861
Another question?

120
00:13:27,877 --> 00:13:33,385
- [Student] If you have a low learning
rate, [speaks too low to hear].

121
00:13:33,385 --> 00:13:37,807
- Question is for a low learning rate, are
you more likely to be stuck in local optima?

122
00:13:37,807 --> 00:13:42,601
I think that makes some intuitive sense, but in
practice, that seems not to be much of a problem.

123
00:13:42,601 --> 00:13:47,030
I think we'll talk a bit
more about that later today.

124
00:13:47,030 --> 00:13:53,151
Today I wanted to talk about a couple other really interesting
and important topics when we're training neural networks.

125
00:13:53,151 --> 00:13:59,655
In particular, I wanted to talk, we've kind of alluded to this fact
of fancier, more powerful optimization algorithms a couple times.

126
00:13:59,655 --> 00:14:07,067
I wanted to spend some time today and really dig into those and talk about what
are the actual optimization algorithms that most people are using these days.

127
00:14:07,067 --> 00:14:10,364
We also touched on regularization
in earlier lectures.

128
00:14:10,364 --> 00:14:15,806
This concept of making your network do additional
things to reduce the gap between train and test error.

129
00:14:15,806 --> 00:14:22,143
I wanted to talk about some more strategies that people are using
in practice of regularization, with respect to neural networks.

130
00:14:22,143 --> 00:14:26,401
Finally, I also wanted to talk a bit
about transfer learning, where you can

131
00:14:26,401 --> 00:14:31,490
sometimes get away with using less data than you
think by transferring from one problem to another.

132
00:14:32,821 --> 00:14:39,885
If you recall from a few lectures ago, the kind of core
strategy in training neural networks is an optimization problem

133
00:14:39,885 --> 00:14:50,982
where we write down some loss function, which defines, for each value of the network weights,
the loss function tells us how good or bad is that value of the weights doing on our problem.

134
00:14:50,982 --> 00:14:56,508
Then we imagine that this loss function gives
us some nice landscape over the weights,

135
00:14:56,508 --> 00:15:04,142
where on the right, I've shown this maybe small, two-dimensional
problem, where the X and Y axes are two values of the weights.

136
00:15:04,142 --> 00:15:07,984
Then the color of the plot kind of
represents the value of the loss.

137
00:15:07,984 --> 00:15:15,195
In this kind of cartoon picture of a two-dimensional problem,
we're only optimizing over these two values, W one, W two.

138
00:15:15,195 --> 00:15:23,203
The goal is to find the most red region in this case, which
corresponds to the setting of the weights with the lowest loss.

139
00:15:23,203 --> 00:15:29,099
Remember, we've been working so far with this extremely
simple optimization algorithm, stochastic gradient descent,

140
00:15:29,099 --> 00:15:32,393
where it's super simple, it's three lines.

141
00:15:32,393 --> 00:15:39,179
While true, we first evaluate the loss in
the gradient on some mini batch of data.

142
00:15:39,179 --> 00:15:44,656
Then we step, updating our parameter vector
in the negative direction of the gradient

143
00:15:44,656 --> 00:15:48,798
because this gives, again, the direction
of greatest decrease of the loss function.

144
00:15:48,798 --> 00:15:56,282
Then we repeat this over and over again, and hopefully we converge
to the red region and we get great errors and we're very happy.

145
00:15:56,282 --> 00:16:05,462
But unfortunately, this relatively simple optimization algorithm has
quite a lot of problems that actually could come up in practice.

146
00:16:05,462 --> 00:16:08,713
One problem with stochastic
gradient descent,

147
00:16:08,713 --> 00:16:18,969
imagine what happens if our objective function looks something like
this, where, again, we're plotting two values, W one and W two.

148
00:16:18,969 --> 00:16:23,472
As we change one of those values,
the loss function changes very slowly.

149
00:16:23,472 --> 00:16:26,687
As we change the horizontal value,
then our loss changes slowly.

150
00:16:28,152 --> 00:16:34,930
As we go up and down in this landscape, now our loss is
very sensitive to changes in the vertical direction.

151
00:16:34,930 --> 00:16:40,757
By the way, this is referred to as the loss
having a bad condition number at this point,

152
00:16:40,757 --> 00:16:46,050
which is the ratio between the largest and smallest
singular values of the Hessian matrix at that point.

153
00:16:46,050 --> 00:16:50,497
But the intuitive idea is that the loss
landscape kind of looks like a taco shell.

154
00:16:50,497 --> 00:16:54,393
It's sort of very sensitive in one direction,
not sensitive in the other direction.

155
00:16:54,393 --> 00:17:00,633
The question is what might SGD, stochastic gradient
descent, do on a function that looks like this?

156
00:17:05,310 --> 00:17:12,196
If you run stochastic gradient descent on this type of function,
you might get this characteristic zigzagging behavior,

157
00:17:12,197 --> 00:17:22,111
where because for this type of objective function, the direction of
the gradient does not align with the direction towards the minima.

158
00:17:22,112 --> 00:17:29,335
When you compute the gradient and take a step, you might step
sort of over this line and sort of zigzag back and forth.

159
00:17:29,335 --> 00:17:35,995
In effect, you get very slow progress along the horizontal
dimension, which is the less sensitive dimension, and you get this

160
00:17:35,995 --> 00:17:41,551
zigzagging, nasty, nasty zigzagging behavior
across the fast-changing dimension.

161
00:17:41,551 --> 00:17:50,139
This is undesirable behavior. By the way, this problem
actually becomes much more common in high dimensions.

162
00:17:51,186 --> 00:18:00,617
In this kind of cartoon picture, we're only showing a two-dimensional optimization landscape, but in
practice, our neural networks might have millions, tens of millions, hundreds of millions of parameters.

163
00:18:00,617 --> 00:18:14,221
That's hundreds of millions of directions along which this thing can move. Now among those hundreds of millions of different
directions to move, if the ratio between the largest one and the smallest one is bad, then SGD will not perform so nicely.

164
00:18:14,221 --> 00:18:20,573
You can imagine that if we have 100 million parameters, probably
the maximum ratio between those two will be quite large.

165
00:18:20,573 --> 00:18:26,398
I think this is actually quite a big problem in
practice for many high-dimensional problems.

166
00:18:27,793 --> 00:18:33,564
Another problem with SGD has to do with
this idea of local minima or saddle points.

167
00:18:33,564 --> 00:18:44,003
Here I've sort of swapped the graph a little bit. Now the X axis is showing the
value of one parameter, and then the Y axis is showing the value of the loss.

168
00:18:44,003 --> 00:18:51,583
In this top example, we have kind of this curvy objective
function, where there's a valley in the middle.

169
00:18:51,583 --> 00:18:55,036
What happens to SGD in this situation?

170
00:18:55,036 --> 00:18:57,031
- [Student] [speaks too low to hear]

171
00:18:57,031 --> 00:19:04,454
- In this situation, SGD will get stuck because at this local
minima, the gradient is zero because it's locally flat.

172
00:19:04,454 --> 00:19:09,194
Now remember with SGD, we compute the gradient
and step in the direction of opposite gradient,

173
00:19:09,194 --> 00:19:15,862
so if at our current point, the opposite gradient is zero, then we're
not going to make any progress, and we'll get stuck at this point.

174
00:19:15,862 --> 00:19:19,406
There's another problem with this idea
of saddle points.

175
00:19:19,406 --> 00:19:26,140
Rather than being a local minima, you can imagine a point where
in one direction we go up, and in the other direction we go down.

176
00:19:26,140 --> 00:19:28,953
Then at our current point,
the gradient is zero.

177
00:19:28,953 --> 00:19:35,899
Again, in this situation, the function will get stuck
at the saddle point because the gradient is zero.

178
00:19:35,899 --> 00:19:48,122
Although one thing I'd like to point out is that in one dimension, in a one-dimensional problem like this, local
minima seem like a big problem and saddle points seem like kind of not something to worry about, but in fact,

179
00:19:48,122 --> 00:19:57,171
it's the opposite once you move to very high-dimensional problems because, again, if you
think about you're in this 100 million dimensional space, what does a saddle point mean?

180
00:19:57,171 --> 00:20:03,135
That means that at my current point, some directions the
loss goes up, and some directions the loss goes down.

181
00:20:03,135 --> 00:20:09,591
If you have 100 million dimensions, that's probably going to happen more
frequently than, that's probably going to happen almost everywhere, basically.

182
00:20:09,591 --> 00:20:16,744
Whereas a local minima says that of all those 100 million directions
that I can move, every one of them causes the loss to go up.

183
00:20:16,744 --> 00:20:22,316
In fact, that seems pretty rare when you're thinking
about, again, these very high-dimensional problems.

184
00:20:23,270 --> 00:20:33,283
Really, the idea that has come to light in the last few years is that when you're training these
very large neural networks, the problem is more about saddle points and less about local minima.

185
00:20:33,283 --> 00:20:40,140
By the way, this also is a problem not just exactly
at the saddle point, but also near the saddle point.

186
00:20:40,140 --> 00:20:47,935
If you look at the example on the bottom, you see that in the regions around
the saddle point, the gradient isn't zero, but the slope is very small.

187
00:20:47,935 --> 00:20:53,611
That means that if we're, again, just stepping in the direction of
the gradient, and that gradient is very small, we're going to make

188
00:20:53,611 --> 00:21:01,872
very, very slow progress whenever our current parameter
value is near a saddle point in the objective landscape.

189
00:21:01,872 --> 00:21:10,115
This is actually a big problem.
Another problem with SGD comes from the S.

190
00:21:10,115 --> 00:21:13,521
Remember that SGD is
stochastic gradient descent.

191
00:21:13,521 --> 00:21:20,586
Recall that our loss function is typically defined by
computing the loss over many, many different examples.

192
00:21:20,586 --> 00:21:26,119
In this case, if N is your whole training set,
then that could be something like a million.

193
00:21:26,119 --> 00:21:29,347
Each time computing the loss
would be very, very expensive.

194
00:21:29,347 --> 00:21:36,957
In practice, remember that we often estimate the loss and
estimate the gradient using a small mini batch of examples.

195
00:21:36,957 --> 00:21:42,148
What this means is that we're not actually getting the
true information about the gradient at every time step.

196
00:21:42,148 --> 00:21:46,773
Instead, we're just getting some noisy
estimate of the gradient at our current point.

197
00:21:46,773 --> 00:21:50,575
Here on the right, I've kind of faked
this plot a little bit.

198
00:21:50,575 --> 00:21:59,927
I've just added random uniform noise to the gradient at every
point, and then run SGD with these noisy, messed up gradients.

199
00:21:59,927 --> 00:22:07,987
This is maybe not exactly what happens with the SGD process, but it still give
you the sense that if there's noise in your gradient estimates, then vanilla SGD

200
00:22:07,987 --> 00:22:14,036
kind of meanders around the space and might actually
take a long time to get towards the minima.

201
00:22:15,723 --> 00:22:18,966
Now that we've talked about a lot
of these problems.

202
00:22:18,966 --> 00:22:20,956
Sorry, was there a question?

203
00:22:20,956 --> 00:22:25,123
- [Student] [speaks too low to hear]