-
Notifications
You must be signed in to change notification settings - Fork 213
Expand file tree
/
Copy pathLecture 4 _ Introduction to Neural Networks.srt
More file actions
6638 lines (5178 loc) · 114 KB
/
Lecture 4 _ Introduction to Neural Networks.srt
File metadata and controls
6638 lines (5178 loc) · 114 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:06,804 --> 00:00:10,610
[students murmuring]
2
00:00:10,610 --> 00:00:15,375
- Okay, so good afternoon
everyone, let's get started.
3
00:00:15,375 --> 00:00:18,257
So hi, so for those of
you who I haven't met yet,
4
00:00:18,257 --> 00:00:21,470
my name is Serena Yeung and I'm the third
5
00:00:21,470 --> 00:00:24,853
and final instructor for this class,
6
00:00:24,853 --> 00:00:28,307
and I'm also a PhD student
in Fei-Fei's group.
7
00:00:28,307 --> 00:00:31,292
Okay, so today we're going
to talk about backpropagation
8
00:00:31,292 --> 00:00:33,843
and neural networks, and so
now we're really starting
9
00:00:33,843 --> 00:00:37,346
to get to some of the core
material in this class.
10
00:00:37,346 --> 00:00:39,929
Before we begin, let's see, oh.
11
00:00:42,124 --> 00:00:44,166
So a few administrative details,
12
00:00:44,166 --> 00:00:47,602
so assignment one is due
Thursday, April 20th,
13
00:00:47,602 --> 00:00:51,522
so a reminder, we shifted
the date back by a little bit
14
00:00:51,522 --> 00:00:55,105
and it's going to be due
11:59 p.m. on Canvas.
15
00:00:56,722 --> 00:00:59,316
So you should start thinking
about your projects,
16
00:00:59,316 --> 00:01:02,327
there are TA specialties
listed on the Piazza website
17
00:01:02,327 --> 00:01:05,432
so if you have questions
about a specific project topic
18
00:01:05,432 --> 00:01:08,958
you're thinking about, you
can go and try and find
19
00:01:08,958 --> 00:01:12,682
the TAs that might be most relevant.
20
00:01:12,682 --> 00:01:15,910
And then also for Google Cloud,
so all students are going
21
00:01:15,910 --> 00:01:19,130
to get $100 in credits
to use for Google Cloud
22
00:01:19,130 --> 00:01:21,268
for their assignments and project,
23
00:01:21,268 --> 00:01:22,630
so you should be receiving an email
24
00:01:22,630 --> 00:01:24,285
for that this week, I think.
25
00:01:24,285 --> 00:01:25,666
A lot of you may have already, and then
26
00:01:25,666 --> 00:01:28,227
for those of you who haven't,
they're going to come,
27
00:01:28,227 --> 00:01:31,060
should be by the end of this week.
28
00:01:32,544 --> 00:01:35,721
Okay so where we are, so
far we've talked about
29
00:01:35,721 --> 00:01:39,207
how to define a classifier
using a function f,
30
00:01:39,207 --> 00:01:41,669
parameterized by weights
W, and this function f
31
00:01:41,669 --> 00:01:45,836
is going to take data x as input,
and output a vector of scores
32
00:01:48,027 --> 00:01:51,432
for each of the classes
that you want to classify.
33
00:01:51,432 --> 00:01:54,778
And so from here we can also define
34
00:01:54,778 --> 00:01:57,314
a loss function, so for
example, the SVM loss function
35
00:01:57,314 --> 00:02:00,445
that we've talked about
which basically quantifies
36
00:02:00,445 --> 00:02:02,053
how happy or unhappy we are with
37
00:02:02,053 --> 00:02:04,167
the scores that we've produced, right,
38
00:02:04,167 --> 00:02:07,727
and then we can use that to
define a total loss term.
39
00:02:07,727 --> 00:02:10,757
So L here, which is a
combination of this data term,
40
00:02:10,757 --> 00:02:14,617
combined with a regularization
term that expresses
41
00:02:14,617 --> 00:02:16,601
how simple our model is,
and we have a preference
42
00:02:16,601 --> 00:02:20,201
for simpler models, for
better generalization.
43
00:02:20,201 --> 00:02:23,295
And so now we want to
find the parameters W
44
00:02:23,295 --> 00:02:25,209
that correspond to our lowest loss, right?
45
00:02:25,209 --> 00:02:27,092
We want to minimize the loss function,
46
00:02:27,092 --> 00:02:28,497
and so to do that we want to find
47
00:02:28,497 --> 00:02:31,497
the gradient of L with respect to W.
48
00:02:33,275 --> 00:02:35,829
So last lecture we talked
about how we can do this
49
00:02:35,829 --> 00:02:37,735
using optimization, and we're going
50
00:02:37,735 --> 00:02:39,976
to iteratively take steps in the direction
51
00:02:39,976 --> 00:02:42,817
of steepest descent, which is
the negative of the gradient,
52
00:02:42,817 --> 00:02:44,819
in order to walk down this loss landscape
53
00:02:44,819 --> 00:02:47,291
and get to the point
of lowest loss, right?
54
00:02:47,291 --> 00:02:51,947
And we saw how this gradient
descent can basically take
55
00:02:51,947 --> 00:02:54,770
this trajectory, looking
like this image on the right,
56
00:02:54,770 --> 00:02:59,293
getting to the bottom
of your loss landscape.
57
00:02:59,293 --> 00:03:00,126
Oh!
58
00:03:02,247 --> 00:03:04,376
Okay, and so we also
talked about different ways
59
00:03:04,376 --> 00:03:06,154
for computing a gradient, right?
60
00:03:06,154 --> 00:03:08,538
We can compute this numerically
61
00:03:08,538 --> 00:03:10,696
using finite difference approximation
62
00:03:10,696 --> 00:03:13,444
which is slow and approximate,
but at the same time
63
00:03:13,444 --> 00:03:14,725
it's really easy to write out,
64
00:03:14,725 --> 00:03:17,589
you know you can always
get the gradient this way.
65
00:03:17,589 --> 00:03:21,064
We also talked about how to
use the analytic gradient
66
00:03:21,064 --> 00:03:23,113
and computing this is, it's fast
67
00:03:23,113 --> 00:03:25,227
and exact once you've
gotten the expression for
68
00:03:25,227 --> 00:03:27,499
the analytic gradient, but
at the same time you have
69
00:03:27,499 --> 00:03:29,670
to do all the math and the
calculus to derive this,
70
00:03:29,670 --> 00:03:32,734
so it's also, you know, easy
to make mistakes, right?
71
00:03:32,734 --> 00:03:34,925
So in practice what we want
to do is we want to derive
72
00:03:34,925 --> 00:03:37,620
the analytic gradient and use this,
73
00:03:37,620 --> 00:03:39,837
but at the same time check
our implementation using
74
00:03:39,837 --> 00:03:41,267
the numerical gradient to make sure
75
00:03:41,267 --> 00:03:44,600
that we've gotten all of our math right.
76
00:03:46,032 --> 00:03:48,518
So today we're going to
talk about how to compute
77
00:03:48,518 --> 00:03:52,261
the analytic gradient for
arbitrarily complex functions,
78
00:03:52,261 --> 00:03:55,824
using a framework that I'm going
to call computational graphs.
79
00:03:55,824 --> 00:03:58,506
And so basically what a
computational graph is,
80
00:03:58,506 --> 00:04:00,555
is that we can use this
kind of graph in order
81
00:04:00,555 --> 00:04:04,010
to represent any function,
where the nodes of the graph
82
00:04:04,010 --> 00:04:06,611
are steps of computation
that we go through.
83
00:04:06,611 --> 00:04:07,952
So for example, in this example,
84
00:04:07,952 --> 00:04:11,306
the linear classifier
that we've talked about,
85
00:04:11,306 --> 00:04:14,737
the inputs here are x and W, right,
86
00:04:14,737 --> 00:04:18,180
and then this multiplication
node represents
87
00:04:18,180 --> 00:04:21,321
the matrix multiplier,
the multiplication of
88
00:04:21,322 --> 00:04:25,038
the parameters W with
our data x that we have,
89
00:04:25,038 --> 00:04:27,478
outputting our vector of scores.
90
00:04:27,478 --> 00:04:29,469
And then we have another
computational node
91
00:04:29,469 --> 00:04:31,948
which represents our hinge loss, right,
92
00:04:31,948 --> 00:04:35,113
computing our data loss term, Li.
93
00:04:35,113 --> 00:04:38,198
And we also have this
regularization term at
94
00:04:38,198 --> 00:04:39,340
the bottom right, so this node
95
00:04:39,340 --> 00:04:42,964
which computes our regularization term,
96
00:04:42,964 --> 00:04:44,877
and then our total loss
here at the end, L,
97
00:04:44,877 --> 00:04:49,044
is the sum of the regularization
term and the data term.
98
00:04:50,624 --> 00:04:52,325
And the advantage is
that once we can express
99
00:04:52,325 --> 00:04:54,675
a function using a computational graph,
100
00:04:54,675 --> 00:04:58,103
then we can use a technique
that we call backpropagation
101
00:04:58,103 --> 00:05:00,261
which is going to recursively
use the chain rule
102
00:05:00,261 --> 00:05:02,227
in order to compute the gradient
103
00:05:02,227 --> 00:05:05,568
with respect to every variable
in the computational graph,
104
00:05:05,568 --> 00:05:09,830
and so we're going to
see how this is done.
105
00:05:09,830 --> 00:05:11,675
And this becomes very
useful when we start working
106
00:05:11,675 --> 00:05:13,413
with really complex functions,
107
00:05:13,413 --> 00:05:15,502
so for example,
convolutional neural networks
108
00:05:15,502 --> 00:05:17,836
that we're going to talk
about later in this class.
109
00:05:17,836 --> 00:05:20,900
We have here the input image at the top,
110
00:05:20,900 --> 00:05:22,364
we have our loss at the bottom,
111
00:05:22,364 --> 00:05:24,023
and the input has to
go through many layers
112
00:05:24,023 --> 00:05:26,269
of transformations in order to get all
113
00:05:26,269 --> 00:05:29,102
the way down to the loss function.
114
00:05:30,896 --> 00:05:33,165
And this can get even
crazier with things like,
115
00:05:33,165 --> 00:05:35,447
the, you know, like a
neural turing machine,
116
00:05:35,447 --> 00:05:37,869
which is another kind
of deep learning model,
117
00:05:37,869 --> 00:05:39,915
and in this case you can see
that the computational graph
118
00:05:39,915 --> 00:05:43,413
for this is really insane, and especially,
119
00:05:43,413 --> 00:05:45,897
we end up, you know,
unrolling this over time.
120
00:05:45,897 --> 00:05:47,562
It's basically completely impractical
121
00:05:47,562 --> 00:05:49,901
if you want to compute the gradients
122
00:05:49,901 --> 00:05:53,234
for any of these intermediate variables.
123
00:05:55,560 --> 00:05:59,139
Okay, so how does backpropagation work?
124
00:05:59,139 --> 00:06:01,838
So we're going to start
off with a simple example,
125
00:06:01,838 --> 00:06:03,672
where again, our goal is
that we have a function.
126
00:06:03,672 --> 00:06:06,228
So in this case, f of x, y, z
127
00:06:06,228 --> 00:06:08,614
equals x plus y times z,
128
00:06:08,614 --> 00:06:10,746
and we want to find the
gradients of the output of
129
00:06:10,746 --> 00:06:13,572
the function with respect
to any of the variables.
130
00:06:13,572 --> 00:06:15,365
So the first step, always, is we want
131
00:06:15,365 --> 00:06:17,329
to take our function f, and we want
132
00:06:17,329 --> 00:06:20,623
to represent it using
a computational graph.
133
00:06:20,623 --> 00:06:23,339
Right, so here our computational
graph is on the right,
134
00:06:23,339 --> 00:06:25,957
and you can see that we have our,
135
00:06:25,957 --> 00:06:27,892
first we have the plus node, so x plus y,
136
00:06:27,892 --> 00:06:30,044
and then we have this
multiplication node, right,
137
00:06:30,044 --> 00:06:34,060
for the second computation
that we're doing.
138
00:06:34,060 --> 00:06:36,201
And then, now we're going
to do a forward pass
139
00:06:36,201 --> 00:06:38,732
of this network, so given the values of
140
00:06:38,732 --> 00:06:40,782
the variables that we have, so here,
141
00:06:40,782 --> 00:06:42,944
x equals negative two, y equals five
142
00:06:42,944 --> 00:06:46,251
and z equals negative four,
I'm going to fill these all in
143
00:06:46,251 --> 00:06:49,417
in our computational graph,
and then here we can compute
144
00:06:49,417 --> 00:06:52,965
an intermediate value,
so x plus y gives three,
145
00:06:52,965 --> 00:06:55,045
and then finally we pass it through again,
146
00:06:55,045 --> 00:06:56,990
through the last node, the multiplication,
147
00:06:56,990 --> 00:07:00,823
to get our final node
of f equals negative 12.
148
00:07:04,310 --> 00:07:08,784
So here we want to give every
intermediate variable a name.
149
00:07:08,784 --> 00:07:10,438
So here I've called this
intermediate variable after
150
00:07:10,438 --> 00:07:14,997
the plus node q, and we
have q equals x plus y,
151
00:07:14,997 --> 00:07:18,609
and then f equals q times z,
using this intermediate node.
152
00:07:18,609 --> 00:07:21,341
And I've also written
out here, the gradients
153
00:07:21,341 --> 00:07:24,763
of q with respect to x
and y, which are just one
154
00:07:24,763 --> 00:07:28,131
because of the addition,
and then the gradients of f
155
00:07:28,131 --> 00:07:31,653
with respect to q and z,
which is z and q respectively
156
00:07:31,653 --> 00:07:34,607
because of the multiplication rule.
157
00:07:34,607 --> 00:07:36,598
And so what we want to
find, is we want to find
158
00:07:36,598 --> 00:07:40,431
the gradients of f with
respect to x, y and z.
159
00:07:43,356 --> 00:07:46,822
So what backprop is, it's
a recursive application of
160
00:07:46,822 --> 00:07:49,182
the chain rule, so we're
going to start at the back,
161
00:07:49,182 --> 00:07:51,311
the very end of the computational graph,
162
00:07:51,311 --> 00:07:53,360
and then we're going to
work our way backwards
163
00:07:53,360 --> 00:07:56,373
and compute all the
gradients along the way.
164
00:07:56,373 --> 00:07:59,015
So here if we start at
the very end, right,
165
00:07:59,015 --> 00:08:01,217
we want to compute the
gradient of the output
166
00:08:01,217 --> 00:08:04,674
with respect to the last
variable, which is just f.
167
00:08:04,674 --> 00:08:08,591
And so this gradient is
just one, it's trivial.
168
00:08:10,065 --> 00:08:12,604
So now, moving backwards,
we want the gradient
169
00:08:12,604 --> 00:08:15,687
with respect to z, right, and we know
170
00:08:16,637 --> 00:08:19,137
that df over dz is equal to q.
171
00:08:19,993 --> 00:08:22,636
So the value of q is just three,
172
00:08:22,636 --> 00:08:26,386
and so we have here, df
over dz equals three.
173
00:08:28,819 --> 00:08:31,688
And so next if we want to do df over dq,
174
00:08:31,688 --> 00:08:33,855
what is the value of that?
175
00:08:36,732 --> 00:08:37,957
What is df over dq?
176
00:08:37,957 --> 00:08:42,040
So we have here, df over
dq is equal to z, right,
177
00:08:46,012 --> 00:08:49,012
and the value of z is negative four.
178
00:08:49,963 --> 00:08:54,130
So here we have df over dq
is equal to negative four.
179
00:08:57,485 --> 00:09:00,802
Okay, so now continuing to
move backwards to the graph,
180
00:09:00,802 --> 00:09:03,793
we want to find df over dy, right,
181
00:09:03,793 --> 00:09:06,251
but here in this case, the
gradient with respect to y,
182
00:09:06,251 --> 00:09:08,986
y is not connected directly to f, right?
183
00:09:08,986 --> 00:09:12,838
It's connected through an
intermediate node of z,
184
00:09:12,838 --> 00:09:14,756
and so the way we're going to do this
185
00:09:14,756 --> 00:09:17,998
is we can leverage the
chain rule which says
186
00:09:17,998 --> 00:09:21,748
that df over dy can be
written as df over dq,
187
00:09:22,746 --> 00:09:26,754
times dq over dy, and
so the intuition of this
188
00:09:26,754 --> 00:09:30,707
is that in order to get to
find the effect of y on f,
189
00:09:30,707 --> 00:09:33,348
this is actually equivalent to if we take
190
00:09:33,348 --> 00:09:37,416
the effect of q times q on f,
which we already know, right?
191
00:09:37,416 --> 00:09:41,330
df over dq is equal to negative four,
192
00:09:41,330 --> 00:09:45,497
and we compound it with the
effect of y on q, dq over dy.
193
00:09:46,604 --> 00:09:50,986
So what's dq over dy
equal to in this case?
194
00:09:50,986 --> 00:09:51,819
- [Student] One.
195
00:09:51,819 --> 00:09:52,980
- One, right. Exactly.
196
00:09:52,980 --> 00:09:55,916
So dq over dy is equal to
one, which means, you know,
197
00:09:55,916 --> 00:09:57,984
if we change y by a little bit,
198
00:09:57,984 --> 00:09:59,408
q is going to change by approximately
199
00:09:59,408 --> 00:10:01,627
the same amount right, this is the effect,
200
00:10:01,627 --> 00:10:05,585
and so what this is
doing is this is saying,
201
00:10:05,585 --> 00:10:07,474
well if I change y by a little bit,
202
00:10:07,474 --> 00:10:10,807
the effect of y on q is going to be one,
203
00:10:13,249 --> 00:10:16,882
and then the effect of q on f
is going to be approximately
204
00:10:16,882 --> 00:10:18,628
a factor of negative four, right?
205
00:10:18,628 --> 00:10:20,405
So then we multiply these together
206
00:10:20,405 --> 00:10:24,053
and we get that the effect of y on f
207
00:10:24,053 --> 00:10:26,470
is going to be negative four.
208
00:10:30,887 --> 00:10:33,249
Okay, so now if we want
to do the same thing for
209
00:10:33,249 --> 00:10:35,632
the gradient with respect to x, right,
210
00:10:35,632 --> 00:10:38,074
we can do the, we can
follow the same procedure,
211
00:10:38,074 --> 00:10:41,191
and so what is this going to be?
212
00:10:41,191 --> 00:10:42,711
[students speaking away from microphone]
213
00:10:42,711 --> 00:10:44,746
- I heard the same.
214
00:10:44,746 --> 00:10:48,746
Yeah exactly, so in this
case we want to, again,
215
00:10:49,636 --> 00:10:51,051
apply the chain rule, right?
216
00:10:51,051 --> 00:10:55,882
We know the effect of q on
f is negative four,
217
00:10:55,882 --> 00:10:59,167
and here again, since we have
also the same addition node,
218
00:10:59,167 --> 00:11:01,958
dq over dx is equal to one, again,
219