-
Notifications
You must be signed in to change notification settings - Fork 213
Expand file tree
/
Copy pathLecture 1 _ Introduction to Convolutional Neural Networks for Visual Recognition.srt
More file actions
4402 lines (3444 loc) · 79.4 KB
/
Lecture 1 _ Introduction to Convolutional Neural Networks for Visual Recognition.srt
File metadata and controls
4402 lines (3444 loc) · 79.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1
00:00:07,641 --> 00:00:10,308
- So welcome everyone to CS231n.
2
00:00:11,762 --> 00:00:14,235
I'm super excited to
offer this class again
3
00:00:14,235 --> 00:00:15,507
for the third time.
4
00:00:15,507 --> 00:00:17,568
It seems that every
time we offer this class
5
00:00:17,568 --> 00:00:21,523
it's growing exponentially
unlike most things in the world.
6
00:00:21,523 --> 00:00:24,434
This is the third time
we're teaching this class.
7
00:00:24,434 --> 00:00:26,466
The first time we had 150 students.
8
00:00:26,466 --> 00:00:29,000
Last year, we had 350
students, so it doubled.
9
00:00:29,000 --> 00:00:32,852
This year we've doubled
again to about 730 students
10
00:00:32,852 --> 00:00:34,806
when I checked this morning.
11
00:00:34,806 --> 00:00:38,428
So anyone who was not able
to fit into the lecture hall
12
00:00:38,428 --> 00:00:40,094
I apologize.
13
00:00:40,094 --> 00:00:43,189
But, the videos will be
up on the SCPD website
14
00:00:43,189 --> 00:00:44,931
within about two hours.
15
00:00:44,931 --> 00:00:46,900
So if you weren't able to come today,
16
00:00:46,900 --> 00:00:50,889
then you can still check it
out within a couple hours.
17
00:00:50,889 --> 00:00:55,076
So this class CS231n is
really about computer vision.
18
00:00:55,076 --> 00:00:57,412
And, what is computer vision?
19
00:00:57,412 --> 00:01:00,141
Computer vision is really
the study of visual data.
20
00:01:00,141 --> 00:01:02,578
Since there's so many people
enrolled in this class,
21
00:01:02,578 --> 00:01:04,522
I think I probably don't
need to convince you
22
00:01:04,522 --> 00:01:06,219
that this is an important problem,
23
00:01:06,219 --> 00:01:10,032
but I'm still going to
try to do that anyway.
24
00:01:10,032 --> 00:01:11,895
The amount of visual data in our world
25
00:01:11,895 --> 00:01:14,173
has really exploded to a ridiculous degree
26
00:01:14,173 --> 00:01:15,761
in the last couple of years.
27
00:01:15,761 --> 00:01:17,613
And, this is largely a
result of the large number
28
00:01:17,613 --> 00:01:20,398
of sensors in the world.
29
00:01:20,398 --> 00:01:21,759
Probably most of us in this room
30
00:01:21,759 --> 00:01:23,064
are carrying around smartphones,
31
00:01:23,064 --> 00:01:25,004
and each smartphone has one, two,
32
00:01:25,004 --> 00:01:26,989
or maybe even three cameras on it.
33
00:01:26,989 --> 00:01:28,974
So I think on average
there's even more cameras
34
00:01:28,974 --> 00:01:31,114
in the world than there are people.
35
00:01:31,114 --> 00:01:32,765
And, as a result of all of these sensors,
36
00:01:32,765 --> 00:01:35,371
there's just a crazy large, massive amount
37
00:01:35,371 --> 00:01:37,524
of visual data being produced
out there in the world
38
00:01:37,524 --> 00:01:38,508
each day.
39
00:01:38,508 --> 00:01:41,239
So one statistic that I
really like to kind of put
40
00:01:41,239 --> 00:01:43,858
this in perspective is a 2015 study
41
00:01:43,858 --> 00:01:47,025
from CISCO that estimated that by 2017
42
00:01:48,919 --> 00:01:51,784
which is where we are now that roughly 80%
43
00:01:51,784 --> 00:01:54,484
of all traffic on the
internet would be video.
44
00:01:54,484 --> 00:01:58,074
This is not even counting all the images
45
00:01:58,074 --> 00:02:00,525
and other types of visual data on the web.
46
00:02:00,525 --> 00:02:03,880
But, just from a pure
number of bits perspective,
47
00:02:03,880 --> 00:02:06,002
the majority of bits
flying around the internet
48
00:02:06,002 --> 00:02:07,476
are actually visual data.
49
00:02:07,476 --> 00:02:09,547
So it's really critical
that we develop algorithms
50
00:02:09,547 --> 00:02:13,157
that can utilize and understand this data.
51
00:02:13,157 --> 00:02:15,370
However, there's a
problem with visual data,
52
00:02:15,370 --> 00:02:17,813
and that's that it's
really hard to understand.
53
00:02:17,813 --> 00:02:20,813
Sometimes we call visual
data the dark matter
54
00:02:20,813 --> 00:02:24,526
of the internet in analogy
with dark matter in physics.
55
00:02:24,526 --> 00:02:27,437
So for those of you who have
heard of this in physics
56
00:02:27,437 --> 00:02:31,180
before, dark matter accounts
for some astonishingly large
57
00:02:31,180 --> 00:02:33,377
fraction of the mass in the universe,
58
00:02:33,377 --> 00:02:35,167
and we know about it due to the existence
59
00:02:35,167 --> 00:02:38,293
of gravitational pulls on
various celestial bodies
60
00:02:38,293 --> 00:02:40,535
and what not, but we
can't directly observe it.
61
00:02:40,535 --> 00:02:42,838
And, visual data on the
internet is much the same
62
00:02:42,838 --> 00:02:45,488
where it comprises the majority of bits
63
00:02:45,488 --> 00:02:49,164
flying around the internet,
but it's very difficult
64
00:02:49,164 --> 00:02:51,313
for algorithms to actually
go in and understand
65
00:02:51,313 --> 00:02:54,222
and see what exactly is
comprising all the visual data
66
00:02:54,222 --> 00:02:55,685
on the web.
67
00:02:55,685 --> 00:02:58,466
Another statistic that I
like is that of Youtube.
68
00:02:58,466 --> 00:03:02,309
So roughly every second of clock time
69
00:03:02,309 --> 00:03:05,303
that happens in the world,
there's something like five hours
70
00:03:05,303 --> 00:03:07,746
of video being uploaded to Youtube.
71
00:03:07,746 --> 00:03:09,305
So if we just sit here and count,
72
00:03:09,305 --> 00:03:12,805
one, two, three, now there's 15 more hours
73
00:03:13,929 --> 00:03:15,596
of video on Youtube.
74
00:03:17,076 --> 00:03:18,824
Google has a lot of
employees, but there's no way
75
00:03:18,824 --> 00:03:21,219
that they could ever
have an employee sit down
76
00:03:21,219 --> 00:03:24,146
and watch and understand
and annotate every video.
77
00:03:24,146 --> 00:03:26,856
So if they want to catalog and serve you
78
00:03:26,856 --> 00:03:29,361
relevant videos and maybe
monetize by putting ads
79
00:03:29,361 --> 00:03:32,057
on those videos, it's really
crucial that we develop
80
00:03:32,057 --> 00:03:34,803
technologies that can dive in
and automatically understand
81
00:03:34,803 --> 00:03:37,053
the content of visual data.
82
00:03:38,649 --> 00:03:41,379
So this field of computer vision is
83
00:03:41,379 --> 00:03:44,089
truly an interdisciplinary
field, and it touches
84
00:03:44,089 --> 00:03:45,864
on many different areas of science
85
00:03:45,864 --> 00:03:47,564
and engineering and technology.
86
00:03:47,564 --> 00:03:50,822
So obviously, computer vision's
the center of the universe,
87
00:03:50,822 --> 00:03:53,914
but sort of as a constellation of fields
88
00:03:53,914 --> 00:03:56,453
around computer vision, we
touch on areas like physics
89
00:03:56,453 --> 00:03:59,418
because we need to understand
optics and image formation
90
00:03:59,418 --> 00:04:01,784
and how images are
actually physically formed.
91
00:04:01,784 --> 00:04:03,995
We need to understand
biology and psychology
92
00:04:03,995 --> 00:04:07,879
to understand how animal
brains physically see
93
00:04:07,879 --> 00:04:09,894
and process visual information.
94
00:04:09,894 --> 00:04:12,045
We of course draw a lot
on computer science,
95
00:04:12,045 --> 00:04:14,305
mathematics, and engineering
as we actually strive
96
00:04:14,305 --> 00:04:16,954
to build computer systems that implement
97
00:04:16,954 --> 00:04:19,639
our computer vision algorithms.
98
00:04:19,640 --> 00:04:22,595
So a little bit more about
where I'm coming from
99
00:04:22,595 --> 00:04:24,985
and about where the teaching
staff of this course
100
00:04:24,985 --> 00:04:25,992
is coming from.
101
00:04:25,992 --> 00:04:30,722
Me and my co-instructor
Serena are both PHD students
102
00:04:30,722 --> 00:04:33,606
in the Stanford Vision Lab which is headed
103
00:04:33,606 --> 00:04:37,184
by professor Fei-Fei Li,
and our lab really focuses
104
00:04:37,184 --> 00:04:39,940
on machine learning and
the computer science side
105
00:04:39,940 --> 00:04:41,184
of things.
106
00:04:41,184 --> 00:04:43,308
I work a little bit more
on language and vision.
107
00:04:43,308 --> 00:04:44,900
I've done some projects in that.
108
00:04:44,900 --> 00:04:46,658
And, other folks in our group have worked
109
00:04:46,658 --> 00:04:48,525
a little bit on the neuroscience
and cognitive science
110
00:04:48,525 --> 00:04:49,775
side of things.
111
00:04:52,541 --> 00:04:54,404
So as a bit of introduction,
you might be curious
112
00:04:54,404 --> 00:04:57,557
about how this course relates
to other courses at Stanford.
113
00:04:57,557 --> 00:05:01,408
So we kind of assume a basic
introductory understanding
114
00:05:01,408 --> 00:05:02,848
of computer vision.
115
00:05:02,848 --> 00:05:04,787
So if you're kind of an undergrad,
116
00:05:04,787 --> 00:05:06,926
and you've never seen
computer vision before,
117
00:05:06,926 --> 00:05:09,698
maybe you should've taken
CS131 which was offered
118
00:05:09,698 --> 00:05:14,229
earlier this year by Fei-Fei
and Juan Carlos Niebles.
119
00:05:14,229 --> 00:05:17,361
There was a course taught last quarter
120
00:05:17,361 --> 00:05:20,836
by Professor Chris
Manning and Richard Socher
121
00:05:20,836 --> 00:05:22,705
about the intersection of deep learning
122
00:05:22,705 --> 00:05:24,925
and natural language processing.
123
00:05:24,925 --> 00:05:27,512
And, I imagine a number of
you may have taken that course
124
00:05:27,512 --> 00:05:28,595
last quarter.
125
00:05:31,482 --> 00:05:33,785
There'll be some overlap
between this course and that,
126
00:05:33,785 --> 00:05:35,769
but we're really focusing
on the computer vision
127
00:05:35,769 --> 00:05:38,861
side of thing, and really
focusing all of our motivation
128
00:05:38,861 --> 00:05:40,444
in computer vision.
129
00:05:41,361 --> 00:05:43,078
Also concurrently taught this quarter
130
00:05:43,078 --> 00:05:47,378
is CS231a taught by
Professor Silvio Savarese.
131
00:05:47,378 --> 00:05:52,306
And, CS231a really focuses
is a more all encompassing
132
00:05:52,306 --> 00:05:54,010
computer vision course.
133
00:05:54,010 --> 00:05:57,569
It's focusing on things
like 3D reconstruction,
134
00:05:57,569 --> 00:05:59,896
on matching and robotic vision,
135
00:05:59,896 --> 00:06:01,412
and it's a bit more all encompassing
136
00:06:01,412 --> 00:06:03,813
with regards to vision than our course.
137
00:06:03,813 --> 00:06:06,647
And, this course, CS231n, really focuses
138
00:06:06,647 --> 00:06:09,358
on a particular class
of algorithms revolving
139
00:06:09,358 --> 00:06:11,922
around neural networks and
especially convolutional
140
00:06:11,922 --> 00:06:13,786
neural networks and their applications
141
00:06:13,786 --> 00:06:16,228
to various visual recognition tasks.
142
00:06:16,228 --> 00:06:17,725
Of course, there's also a number
143
00:06:17,725 --> 00:06:19,178
of seminar courses that are taught,
144
00:06:19,178 --> 00:06:21,154
and you'll have to check the syllabus
145
00:06:21,154 --> 00:06:24,631
and course schedule for
more details on those
146
00:06:24,631 --> 00:06:27,867
'cause they vary a bit each year.
147
00:06:27,867 --> 00:06:29,914
So this lecture is normally given
148
00:06:29,914 --> 00:06:31,672
by Professor Fei-Fei Li.
149
00:06:31,672 --> 00:06:34,174
Unfortunately, she wasn't
able to be here today,
150
00:06:34,174 --> 00:06:36,439
so instead for the majority of the lecture
151
00:06:36,439 --> 00:06:38,463
we're going to tag team a little bit.
152
00:06:38,463 --> 00:06:41,996
She actually recorded a
bit of pre-recorded audio
153
00:06:41,996 --> 00:06:44,772
describing to you the
history of computer vision
154
00:06:44,772 --> 00:06:48,229
because this class is a
computer vision course,
155
00:06:48,229 --> 00:06:50,456
and it's very critical and
important that you understand
156
00:06:50,456 --> 00:06:53,289
the history and the context
of all the existing work
157
00:06:53,289 --> 00:06:55,183
that led us to these developments
158
00:06:55,183 --> 00:06:58,000
of convolutional neural
networks as we know them today.
159
00:06:58,500 --> 00:07:00,000
I'll let virtual Fei-Fei take over
160
00:07:00,398 --> 00:07:01,915
[laughing]
161
00:07:01,915 --> 00:07:03,800
and give you a brief
introduction to the history
162
00:07:04,000 --> 00:07:05,500
of computer vision.
163
00:07:08,610 --> 00:07:15,309
Okay let's start with today's agenda.
So we have two topics to cover one is a
164
00:07:15,309 --> 00:07:20,620
brief history of computer vision and the
other one is the overview of our course
165
00:07:20,620 --> 00:07:28,539
CS 231 so we'll start with a very
brief history of where vision comes
166
00:07:28,540 --> 00:07:36,100
from when did computer vision start and
where we are today. The history the
167
00:07:36,100 --> 00:07:44,770
history of vision can go back many many
years ago in fact about 543 million
168
00:07:44,770 --> 00:07:50,800
years ago. What was life like during that
time? Well the earth was mostly water
169
00:07:50,920 --> 00:07:58,300
there were a few species of animals
floating around in the ocean and life
170
00:07:58,300 --> 00:08:03,730
was very chill. Animals didn't move around
much there they don't have eyes or
171
00:08:03,730 --> 00:08:09,640
anything when food swims by they grab
them if the food didn't swim by they
172
00:08:09,640 --> 00:08:17,140
just float around but something really
remarkable happened around 540 million
173
00:08:17,140 --> 00:08:25,509
years ago. From fossil studies zoologists
found out within a very short period of
174
00:08:25,509 --> 00:08:33,820
time — ten million years — the number of
animal species just exploded. It went
175
00:08:33,820 --> 00:08:41,500
from a few of them to hundreds of
thousands and that was strange — what caused this?
176
00:08:41,500 --> 00:08:47,920
There were many theories but for many
years it was a mystery evolutionary
177
00:08:47,920 --> 00:08:55,540
biologists call this evolution's Big Bang.
A few years ago an Australian zoologist
178
00:08:55,540 --> 00:09:01,299
called Andrew Parker proposed one of the
most convincing theory from the studies
179
00:09:01,299 --> 00:09:07,030
of fossils
he discovered around 540 million years
180
00:09:07,030 --> 00:09:19,310
ago the first animals developed eyes and
the onset of vision started this
181
00:09:19,310 --> 00:09:26,610
explosive speciation phase. Animals can
suddenly see; once you can see life
182
00:09:26,610 --> 00:09:32,580
becomes much more proactive. Some
predators went after prey and prey
183
00:09:32,580 --> 00:09:39,980
have to escape from predators so the
evolution or onset of vision started a
184
00:09:39,980 --> 00:09:46,860
evolutionary arms race and animals had
to evolve quickly in order to survive as
185
00:09:46,860 --> 00:09:54,870
a species so that was the beginning of
vision in animals after 540 million
186
00:09:54,870 --> 00:10:01,380
years vision has developed into the
biggest sensory system of almost all
187
00:10:01,380 --> 00:10:09,660
animals especially intelligent animals
in humans we have almost 50% of the
188
00:10:09,660 --> 00:10:15,450
neurons in our cortex involved in visual
processing it is the biggest sensory
189
00:10:15,450 --> 00:10:22,590
system that enables us to survive, work,
move around, manipulate things,
190
00:10:22,590 --> 00:10:29,730
communicate, entertain, and many things.
The vision is really important for
191
00:10:29,730 --> 00:10:38,930
animals and especially intelligent
animals. So that was a quick story of
192
00:10:38,930 --> 00:10:48,329
biological vision. What about humans, the
history of humans making mechanical
193
00:10:48,329 --> 00:10:56,450
vision or cameras? Well one of the early
cameras that we know today is from the
194
00:10:56,450 --> 00:11:04,410
1600s, the Renaissance period of time,
camera obscura and this is a camera
195
00:11:04,410 --> 00:11:13,730
based on pinhole camera theories. It's
very similar to, it's very similar to the
196
00:11:13,730 --> 00:11:21,390
to the early eyes that animals developed
with a hole that collects lights
197
00:11:21,390 --> 00:11:28,020
and then a plane in the back of the
camera that collects the information and
198
00:11:28,020 --> 00:11:36,560
project the imagery. So
as cameras evolved, today we have cameras
199
00:11:36,560 --> 00:11:40,910
everywhere this is one of the most
popular sensors people use from
200
00:11:40,910 --> 00:11:49,040
smartphones to to other sensors. In the
mean time biologists started
201
00:11:49,040 --> 00:11:56,510
studying the mechanism of vision. One of
the most influential work in both human
202
00:11:56,510 --> 00:12:02,690
vision where animal vision as well as
that inspired computer vision is the
203
00:12:02,690 --> 00:12:10,850
work done by Hubel and Wiesel in the 50s
and 60s using electrophysiology.
204
00:12:10,850 --> 00:12:18,170
What they were asking, the question is "what was the visual processing mechanism like
205
00:12:18,170 --> 00:12:26,600
in primates, in mammals" so they chose
to study cat brain which is more or less
206
00:12:26,600 --> 00:12:32,090
similar to human brain from a visual
processing point of view. What they did
207
00:12:32,090 --> 00:12:37,490
is to stick some electrodes in the back
of the cat brain which is where the
208
00:12:37,490 --> 00:12:45,830
primary visual cortex area is and then
look at what stimuli makes the neurons
209
00:12:45,830 --> 00:12:52,970
in the in the back in the primary visual
cortex of cat brain respond excitedly
210
00:12:52,970 --> 00:13:00,380
what they learned is that there are many
types of cells in the, in the primary
211
00:13:00,380 --> 00:13:05,630
visual cortex part of the the cat brain
but one of the most important cell is
212
00:13:05,630 --> 00:13:12,080
the simple cells they respond to
oriented edges when they move in certain
213
00:13:12,080 --> 00:13:18,410
directions. Of course there are also more
complex cells but by and large what they
214
00:13:18,410 --> 00:13:26,060
discovered is visual processing starts
with simple structure of the visual world,
215
00:13:26,060 --> 00:13:32,210
oriented edges and as information
moves along the visual processing
216
00:13:32,210 --> 00:13:38,560
pathway the brain builds up the
complexity of the visual information