Data-Science-Notebook/scrapsheet-DL.qmd at main · ercbk/Data-Science-Notebook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
---
title: "scrapsheet-DL"
format: html
---

## Deep Learning with R

### Misc

-   Current Page: pg 99 3.8.6
-   Components
    -   Low-level tensor manipulation. This translates to TensorFlow APIs:
        -   *Tensors*, including special tensors that store the network’s state (variables)
        -   *Tensor operations* such as addition, relu, matmul
        -   *Backpropagation*, a way to compute the gradient of mathematical expressions (handled in TensorFlow via the GradientTape object)
    -   High-level deep learning concepts. This translates to Keras APIs:
        -   *Layers*, which are combined into a model
        -   A *loss function*, which defines the feedback signal used for learning
        -   An *optimizer*, which determines how learning proceeds
        -   *Metrics* to evaluate model performance, such as accuracy
        -   A *training loop* that performs mini-batch stochastic gradient descent
-   Network Architectures
    -   The topology or architecture of a model defines a *hypothesis space*, a specific series of tensor operations
    -   Types
        -   Two-Branch Networks
        -   Multihead Networks
        -   Residual Connections
-   Loss functions for common tasks
    -   Binary Classification: Binary Cross-Entropy
    -   Multinomial Classification: Categorical Cross-Entropy

### Terms

-   The core building block of neural networks is the **layer**. You can think of a layer as a filter for data: some data goes in, and it comes out in a more useful form. Specifically, layers extract representations out of the data fed into them
    -   [Example]{.ribbon-highlight}: `layer_dense(units = 512, activation = "relu")`

        ``` r
        # psedo code
        output <- relu(dot(W, input) + b)

        # r
        for (i in seq_len(1000)) {
          z <- x + y
          z[z < 0] <- 0
        }
        ```

        -   A dot product (`dot`) between the input tensor and a tensor named $W$
        -   An addition (+) between the resulting matrix and a vector $b$
        -   $W$ and $b$ are weights or trainable parameters of the layer (the kernel and bias attributes, respectively)
        -   A relu operation: `relu(x)` is an element-wise $\max(x, 0)$; relu stands for rectified linear unit

    -   Takes as input one or more tensors and that outputs one or more tensors

    -   Some layers are stateless, but more frequently layers have a state: the layer’s weights, one or several tensors learned with stochastic gradient descent, which together contain the network’s knowledge.

    -   Types

        -   Dense Layers - Typically vector data, stored in rank 2 tensors of shape (samples, features), is often processed by densely connected layers. (`layer_dense`)
        -   Recurrent Layers
            -   Sequence Data: Stored in rank 3 tensors of shape (samples, timesteps, features)
                -   Either in a LSTM Layer (`layer_lstm`) or 1D Convolutional Layer (`layer_conv_1d`)
            -   Image Data: Stored in rank 4 tensors (`layer_conv_2d`)

    -   Stacking Layers

        ``` r
        layer <- layer_dense(units = 32, activation = "relu")
        ```

        -   Says the output's first dimension will be 32

        -   The next layer in the stack must be shaped (happens automatically) to receive an input with a 32 as the first dimension
-   **densely connected** (also called **fully connected**) neural layers
-   Run the model on x (a step called the **forward pass**) to obtain predictions
-   Each iteration over all the training data is called an **epoch**
-   a 10-way **softmax** classification layer, which means it will return an array of 10 probability scores (summing to 1)
-   Axes
    -   First Axis is called the **batch axis** or **batch dimension** or **samples axis** or **samples dimension**.
    -   Second Axis is the features axis for vector data and time steps (or time index) for time series data.
-   Data stored in multidimensional (aka axes) arrays, also called **tensors**
-   **Broadcasting** - An operation on two different-sized tensors where we want the smaller tensor to match the shape of the larger tensor.
    -   Steps:

        1.  Axes (called broadcast axes) are added to the smaller tensor to match the `length(dim(x))` of the larger tensor.
        2.  The smaller tensor is repeated alongside these new axes to match the full shape of the larger tensor

    -   [Example]{.ribbon-highlight}: Manual and Automatic

        ``` r
        # random_array <- function(dim, min = 0, max = 1) {
        #   array(runif(prod(dim), min, max),
        #         dim)
        # }

        X <- random_array(c(32, 10))
        # vector with 10 elements
        y <- random_array(c(10))
        y

        # shape is now (1, 10)
        dim(y) <- c(1, 10)
        y
        #>   [,1]     [,2]      [,3]      [,4]      [,5]      [,6]      [,7]      [,8]      [,9]     [,10]
        #> [1,] 0.4076222 0.173166 0.4457272 0.7208785 0.2414163 0.9192088 0.3056461 0.2270679 0.6933124 0.1063865

        # repeats 1st row 32 times to get a (32, 10) shape matrix
        Y <- y[rep(1, 32), ]
        str(Y)

        # Automatically happens when tensors of different ranks are added
        x <- as_tensor(1, shape = c(64, 3, 32, 10))
        y <- as_tensor(2, shape = c(32, 10))
        z <- x + y
        ```

        -   Makes y's shape the same as X's, so they can be added
        -   Only a mental model. Broadcasting "happens at the algorithmic level rather than at the memory level."
        -   In the automatic example, z is the same shape as x
-   **Tensor Product** (aka dot product)
-   **Tensor Reshaping** means rearranging its rows and columns to match a target shape
-   A **gradient** is the derivative of a tensor operation (or tensor function). Gradients are just the generalization of the concept of derivatives to functions that take tensors as inputs.\
    $$
    \begin{align}
    &w = f(x, y, z) \\
    &\nabla f(x, y, z) = \left[ \frac{\partial w}{\partial x}\: \frac{\partial w}{\partial y}\: \frac{\partial w}{\partial z}\right]
    \end{align}
    $$
    -   Another way of saying that is “If you added 1 to x before plugging it into the function, this is how much w would change, if the function was a straight line”
    -   The gradient of a tensor function represents the curvature of the multidimensional surface described by the function
    -   Tor a specific point, the gradient is a vector that points in the direction of the biggest increase in the function, or equivalently, in the steepest uphill direction
        -   The derivative of the loss curve at the weight value of $W_0$ is `grad(loss_value, W_0)` and this is the *direction of steepest ascent* of the loss function around $W_0$.
    -   The gradient (multiplied by a step size) is subtracted from a point to move it down hill.
        -   You can reduce loss value by moving $W$ in the opposite direction from the gradient. This will put you lower on the loss curve.
-   **Stochastic Gradient Descent (SGD)**
    -   Solving `grad(f(W), W) = 0` for W (i.e. points where derivative is 0) is intractable for real neural networks, where the number of parameters is never less than a few thousand and can often be several tens of millions
    -   Each weight (W) value is a dimension of the loss function space. There could be millions.
    -   Mini-Batch SGD
        1.  Draw a random batch of training samples, x, and corresponding targets, y_true.
        2.  Run the model on x to obtain predictions, y_pred (this is called the **forward pass**).
        3.  Compute the loss of the model on the batch, a measure of the mismatch between y_pred and y_true.
        4.  Compute the gradient of the loss with regard to the model’s parameters (this is called the **backward pass**).
        5.  Move the parameters a little in the opposite direction from the gradient—for example, W = W – (learning_rate \* gradient)—thus reducing the loss on the batch a bit. The learning rate (learning_rate here) would be a scalar factor modulating the “speed” of the gradient descent process.
-   **Optimizers** (aka optimization methods) are variants of SGD which differ in how they execute the weights update (last step)
-   **Momentum**
    -   Addresses two issues with SGD: convergence speed and local minima
    -   It updates the parameter w based not only on the current gradient value but also on the previous parameter update
-   **Backpropagation**
    -   A way to use the derivatives of simple operations (such as addition, relu, or tensor product) to easily compute the gradient of arbitrarily complex combinations of these atomic operations

    -   Backpropagation is the application of the chain rule to a computation graph

    -   Example:

        ``` r
        grad(loss_val, w) = 1 * 1 * 2 = 2
        grad(loss_val, b) = 1 * 1 = 1
        ```

        -   If there are multiple paths linking the two nodes of interest, a and b, `grad(b, a)` obtained by summing the contributions of all the paths

### Tensors

-   Tensors are a generalization of matrices to an arbitrary number of dimensions (note that in the context of tensors, a *dimension* is often called an **axis**)

-   `base::array` is a tensor

-   In deep learning, you’ll generally manipulate tensors with ranks 0 to 4, although you may go up to 5 if you process video data.

-   Weight tensors, which are attributes of the layers, are where the knowledge of the model persists.

-   Components

    -   Number of axes (rank) - `length(dim(train_images))`
    -   Shape - `dim(train_images)`
        -   This is an integer vector that describes how many dimensions the tensor has along each axis,
    -   Data type - `typeof(train_images)`
        -   R’s built-in data types like double and integer
        -   Other tensor implementations also provide support for types like like float16, float32, float64 (corresponding to R’s double), int32 (R’s integer type), etc.
        -   String

-   Types

    -   Scalars (rank 0 tensors, 0 axis) (an R vector of length 1 is conceptually similar to a scalar)

    -   Vectors (rank 1 tensors, 1 axis)

        ``` r
        x <- as.array(c(12, 3, 6, 14, 7))
        str(x)
        #> num [1:5(1d)] 12 3 6 14 7
        ```

    -   Matrices (rank 2 tensors, 2 axes)

        ``` r
        x <- array(seq(3 * 5), dim = c(3, 5))
        x
        #>  [,1] [,2] [,3] [,4] [,5]
        #> [1,] 1 4 7 10 13
        #> [2,] 2 5 8 11 14
        #> [3,] 3 6 9 12 15
        dim(x)
        #> [1] 3 5
        ```

        -   *first axis* are called the rows, and the entries from the *second axis* are called the columns

    -   Cubes (rank 3, 3 axes or a stack of rank 2 tensors)

        ``` r
        x <- array(seq(2 * 3 * 4), dim = c(2, 3, 4))
        str(x)
        #> int [1:2, 1:3, 1:4] 1 2 3 4 5 6 7 8 9 10 ...

        length(dim(x))
        #> [1] 3
        ```

    -   A rank 4 tensor is a stack of rank 3 tensors

-   Data Types and Their Ranks

    -   *Vector data*: Rank 2 tensors of shape (samples, features), where each sample is a vector of numerical attributes (“features”)
        -   e.g. A dataset with 3 features (age, income, gender) and a 10,000 observations would have the shape (10000 3)
        -   e.g. A dataset of 500 text documents where each document has been encoded into a vector with 20,000 values. This dataset would have the shape (500, 20000)
    -   *Times-series data* or *sequence data*: Rank 3 tensors of shape (samples, timesteps, features), where each sample is a sequence (of length timesteps) of feature vectors
        -   Each sample is a matrix (rank 2, features by time steps.j). All the samples make up the rank 3 tensor.
        -   e.g. Stock data where the current, lowest, highest stock price is recorded every minute. There are 390 minutes in a trading day and the dataset has 250 days of data. The shape would be (
    -   *Images*: Rank 4 tensors of shape (samples, height, width, channels), where each sample is a 2D grid of pixels, and each pixel is represented by a vector of values (“channels”)
        -   e.g. A batch of 128 grayscale images of size 256 × 256 could thus be stored in a tensor of shape (128, 256, 256, 1). Grayscale has a single color channel.
        -   e.g. A batch of 128 color images could be stored in a tensor of shape (128, 256, 256, 3). Color images have 3 color channels (R,G,B)
    -   *Video*: Rank 5 tensors of shape (samples, frames, height, width, channels), where each sample is a sequence (of length frames) of images
        -   e.g A 60-second, 144 × 256 YouTube video clip sampled at 4 frames per second would have 240 frames. A batch of four such video clips would be stored in a tensor of shape (4, 240, 144, 256, 3)

-   Operations

    -   Attributes: `ndim`, `shape`, `dtype`

        -   Example: dtype

            ``` r
            r_array <- array(1)
            typeof(r_array)
            #> [1] "double"
            as_tensor(r_array)$dtype
            #> tf.float64

            as_tensor(r_array, dtype = "float32")
            #> tf.Tensor([1.], shape=(1), dtype=float32)
            ```

        -   R only has one integer type, int32, but it gets converted to int64 when an array is coerced to a tensor

        -   R's double type (float32?) is converted to float64 when an array is coerced to a tensor

    -   Coercing to a tensor type

        ``` r
        r_array <- array(1:6, c(2, 3))
        r_array
        #>      [,1] [,2] [,3]
        #> [1,]    1    3    5
        #> [2,]    2    4    6

        as_tensor(r_array)
        #> tf.Tensor(
        #> [[1 3 5]
        #> [2 4 6]], shape=(2, 3), dtype=int)
        ```

    -   Reshaping

        ``` r
        x <- array(1:6)
        x
        #> [1] 1 2 3 4 5 6

        array_reshape(x, dim = c(3, 2))
        #>     [,1] [,2]
        #> [1,]   1    2
        #> [2,]   3    4
        #> [3,]   5    6

        array_reshape(x, dim = c(2, 3))
        #>     [,1] [,2] [,3]
        #> [1,]   1    2    3
        #> [2,]   4    5    6

        array(1:6, dim = c(3, 2))
        #>     [,1] [,2]
        #> [1,]   1    4
        #> [2,]   2    5
        #> [3,]   3    6

        # leaving out an axis specification
        array_reshape(1:6, c(-1, 3))
        #>     [,1] [,2] [,3]
        #> [1,]   1    2    3
        #> [2,]   4    5    6

        as_tensor(1:6, shape = c(NA, 3))
        #> tf.Tensor(
        #> [[1 2 3]
        #> [4 5 6]], shape=(2, 3), dtype=int32)
        ```

        -   When reshaping a vector into an array, the values get placed rowwise (aka C ordering), but when creating a vector of the same shape, the values get placed columnwise (aka Fortran ordering).
        -   [order = "F"]{style=".arg-text"} for `array_reshape` says used Fortran ordering
        -   To let one axis be inferred from the other, pass a -1 or NA for the axis you want inferred

    -   Tensor Slicing: Selecting specific elements (subsetting) in a tensor

        -   Select images 10 through 99

            ``` r
            my_slice <- train_images[10:99, , ]
            #or
            my_slice <- train_images[10:99, all_dims()]
            dim(my_slice)
            #> [1] 90 28 28
            ```

            -   `all_dims` can be useful for larger dim arrays

        -   Allowed to slice along up to two axes at one time

            ``` r
            my_slice <- train_images[, 15:28, 15:28]
            dim(my_slice)
            #> [1] 60000 14 14
            ```

            -   Selects a 14 × 14 pixel area in the bottom-right corner of all images

        -   Leaving the end of a range to be inferred

            ``` r
            train_images <- as_tensor(dataset_mnist()$train$x)
            my_slice <- train_images[, 15:NA, 15:NA]
            ```

        -   Using Negative Indices

            ``` r
            my_slice <- train_images[, 8:-8, 8:-8]
            #> Warning:
            #> Negative numbers are interpreted python-style
            #> ➥ when subsetting tensorflow tensors.
            #> See ?`[.tensorflow.tensor` for details.
            #> To turn off this warning,
            #> ➥ set `options(tensorflow.extract.warn_negatives_pythonic = FALSE#> )`
            ```

            -   The end of the selection is 8 indices from the end

    -   Changing cell values

        -   Arrays

            ``` r
            x <- array(1, dim = c(2, 2))
            x[1, 1] <- 0
            ```

        -   Tensors

            -   Data

                ``` r
                v <- tf$Variable(initial_value = tf$random$normal(shape(3, 1)))
                v
                #> <tf.Variable 'Variable:0' shape=(3, 1) dtype=float32, numpy=
                #> array([[-1.1629326 ],
                #>        [ 0.53641343],
                #>        [ 1.4736737 ]], dtype=float32)>
                ```

                -   v is an instance of a Variable class which makes it's state mutable.

            -   Change a cell

                ``` r
                v[1, 1]$assign(3)
                #> <tf.Variable 'UnreadVariable' shape=(3, 1) dtype=float32, numpy =
                #> array([[3.],
                #>        [1.],
                #>        [1.]], dtype=float32)>
                ```

            -   Change with a vector

                ``` r
                v$assign(tf$ones(shape(3, 1)))
                #> <tf.Variable 'UnreadVariable' shape=(3, 1) dtype=float32, numpy =
                #> array([[1.],
                #>        [1.],
                #>        [1.]], dtype=float32)>
                ```

            -   Add or subtract a value

                ``` r
                v$assign_add(tf$ones(shape(3, 1)))
                #> <tf.Variable 'UnreadVariable' shape=(3, 1) dtype=float32, numpy=
                #> array([[4.],
                #>        [2.],
                #>        [2.]], dtype=float32)>
                ```

                -   assign_sub for subtraction

                -   These are efficient alternatives to `x <- x + value` and `x <- x - value`

    -   Most of transformations and generics work with tensors (See pg 75)

        -   [Example]{.ribbon-highlight}: Arithmetic

            ``` r
            a <- tf$ones(c(2L, 2L))
            b <- tf$square(a)
            c <- tf$sqrt(a)
            d <- b + c

            # matrix multiplication
            e <- tf$matmul(a, b)
            # element-wise multiplication
            e <- e * d
            ```

        -   [Example]{.ribbon-highlight}: 1-index vs 0-index for axes

            ``` r
            m <- as_tensor(1:12, shape = c(3, 4))

            mean(m, axis = 1, keepdims = TRUE)
            #> tf.Tensor([[5 6 7 8]], shape=(1, 4), dtype=int32)

            tf$reduce_mean(m, axis = 0L, keepdims = TRUE)
            #> tf.Tensor([[5 6 7 8]], shape=(1, 4), dtype=int32)
            ```

            -   Each calculates the column means
            -   TensorFlows built-in functions are 0-indexed like python

### Examples

-   Example: Basic MNIST Classification

    ``` r
    library(tensorflow)
    library(keras3)
    mnist <- dataset_mnist()

    train_images <- array_reshape(train_images, c(60000, 28 * 28))
    train_images <- train_images / 255
    test_images <- array_reshape(test_images, c(10000, 28 * 28))
    test_images <- test_images / 255

    train_labels <- mnist$train$y
    test_labels <- mnist$test$y


    model <-
      keras_model_sequential(layers = list(
        layer_dense(units = 512, activation = "relu"),
        layer_dense(units = 10, activation = "softmax")
      ))

    compile(
      model,
      optimizer = "rmsprop",
      loss = "sparse_categorical_crossentropy",
      metrics = "accuracy"
    )

    fit(
      model,
      train_images,
      train_labels,
      epochs = 5,
      batch_size = 128
    )
    #> Epoch 1/5
    #> 60000/60000 [===========================] - 5s - loss: 0.2524 - acc:
    #> ➥ 0.9273
    #> Epoch 2/5
    #> 51328/60000 [=====================>.....] - ETA: 1s - loss: 0.1035 -
    #> ➥ acc: 0.9692

    test_digits <- test_images[1:10, ]
    predictions <- predict(model, test_digits)

    str(predictions)
    #> num [1:10, 1:10] 3.10e-09 3.53e-11 2.55e-07 1.00 8.54e-07 ...

    predictions[1, ]
    #> [1] 3.103298e-09 1.175280e-10 1.060593e-06 4.761311e-05 4.189971e-12
    #> [6] 4.062199e-08 5.244305e-16 9.999473e-01 2.753219e-07 3.826783e-06

    # max probability index for first image
    which.max(predictions[1, ])
    #> [1] 8
    # max probability value for first image
    predictions[1, 8]
    #> [1] 0.9999473
    # corresponding test label of first image
    test_labels[1]
    #> [1] 7

    metrics <-
      evaluate(model,
               test_images,
               test_labels)
    metrics["accuracy"]
    #> accuracy
    #> 0.9795
    ```

    -   Training Data - A stack of 60,000 matrices of 28 × 28 integers. Each such matrix is a grayscale image, with coefficients between 0 and 255 of pixel intensity values
    -   Preprocessing
        -   Scale all values to be in the \[0, 1\] interval
            -   Previously, training image values were in the \[0, 255\] interval.
        -   Coerce array into the shape the model expects, a double array
            -   Previously, training images were stored in a triple array of shape (60000, 28, 28) of type integer and the test images were in a triple array of shape (10000, 28, 28) of type integer
    -   Layers
        -   2 dense layers in total
        -   Final Layer - 10-way softmax classification layer, which returns an array of 10 probability scores (summing to 1)
    -   Compilation Components:
        -   Optimizer - The mechanism through which the model will update itself based on the training data it sees, so as to improve its performance.
        -   Loss Function - How the model will be able to measure its performance on the training data, and thus how it will be able to steer itself in the right direction.
        -   Metrics - Monitored during training and testing. Here, we care only about accuracy (the fraction of the images that were correctly classified).
    -   Fitting the model
        -   For each training epoch, the time elapsed, loss value, and metric value are shown

        -   Process

            -   The model will start to iterate on the training data in mini-batches of 128 samples, five times over (each iteration over all the training data is called an epoch).
            -   For each batch, the model will compute the gradient of the loss with regard to the weights (using the backpropagation algorithm, which derives from the chain rule in calculus) and move the weights in the direction that will reduce the value of the loss for this batch.
            -   After these five epochs, the model will have performed 2,345 gradient updates (469 per epoch), and the loss of the model will be sufficiently low that the model will be capable of classifying handwritten digits with high accuracy.

        -   The last training epoch's (not shown) accuracy is 0.989 (98.9%)

        -   The batch size is 128 which means it fits 128 samples at a time

            ``` r
            batch_1 <- train_images[1:128, , ]

            batch_2 <- train_images[129:256, , ]

            n <- 3
            batch_n <- train_images[seq(to = 128 * n, length.out = 128), , ]
            ```
    -   Predicting on the test set
        -   10 rows are subsetted for testing
        -   A 2-dim array is returned where each number of index i in that array (predictions\[1, \]) corresponds to the probability that digit image test_digits\[1, \] belongs to class i.
            -   i.e. the first row of predictions corresponds to the probabilities that the first test image belongs each of the labels (0-9 digits)
            -   e.g. there's a 9.999473e-01 (99.9%) chance that the first test image is a "7".
    -   Evaluation
        -   The test set accuracy turns out to be 97.9%—that’s quite a bit lower than the training set accuracy (98.9%) (Overfitted)

-   View an image

    ``` r
    digit <- train_images[5, , ]
    plot(as.raster(abs(255 - digit), max = 255))
    ```

### TensorFlow

-   Generate tensors of 1s or 0s

    ``` r
    tf$ones(shape(1, 3))
    #> tf.Tensor([[1. 1. 1.]], shape=(1, 3), dtype=float32)
    tf$ones(c(2L, 1L))
    #> tf.Tensor(
    #> [[1.]
    #> [1.]], shape=(2, 1), dtype=float32)

    tf$zeros(shape(1, 3))
    #> tf.Tensor([[0. 0. 0.]], shape=(1, 3), dtype=float32)
    ```

-   Generate random tensors from a distribution

    ``` r
    tf$random$normal(shape(1, 3), mean = 0, stddev = 1)
    #> tf.Tensor([[ 0.79165614 0.35886717 0.13686056]], shape=(1, 3), dtype=float32)

    tf$random$uniform(shape(1, 3))
    #> tf.Tensor([[0.93715847 0.67879045 0.60081327]], shape=(1, 3), dtype=float32)
    ```

-   Classes

    -   `Variable` is a specific kind of tensor meant to hold mutable state — for instance, the weights of a neural network.
        -   Process

            ``` r
            x <- tf$Variable(0) # <1>
            with(tf$GradientTape() %as% tape, { # <2>
              y <- 2 * x + 3 # <3>
            })
            grad_of_y_wrt_x <- tape$gradient(y, x) # <4>
            ```

            1.  Instantiate a scalar `Variable` with an initial value of 0
            2.  Open a `GradientTape` scope
            3.  Inside the scope, apply some tensor operations to our variable
            4.  Use the tape to retrieve the gradient of the output [y]{.var-text} with respect to our variable [x]{.var-text}

        -   Example

            ``` r
            W <- tf$Variable(random_array(c(2, 2))) # <1>
            b <- tf$Variable(array(0, dim = c(2))) # <2>
            x <- random_array(c(2, 2)) # <3>

            with(tf$GradientTape() %as% tape, {
              y <- tf$matmul(x, W) + b # <4>
            })
            grad_of_y_wrt_W_and_b <- tape$gradient(y, list(W, b))

            str(grad_of_y_wrt_W_and_b) # <5>
            #> List of 2
            #> $ :<tf.Tensor: shape=(2, 2), dtype=float64, numpy=…>
            #> $ :<tf.Tensor: shape=(2), dtype=float64, numpy=array([2., 2.])>
            ```

            1.  Instantiated a 2x2 array (matrix) with random values stored in a mutable `Variable` class
            2.  Instantiated a length 2 vector with 0 values
            3.  Created a 2x2 array (matrix) with random values
            4.  `matmul` performs a dot product
            5.  [grad_of_y_wrt_W_and_b]{.var-text} is a list of two tensors with the same shapes as [W]{.var-text} and [b]{.var-text}, respectively
    -   `GradientTape`
        -   It’s a context manager that will “record” the tensor operations that run inside its scope, in the form of a computation graph (sometimes called a “tape”).
        -   This graph can then be used to retrieve the gradient of any output with respect to any variable or set of variables

### Keras

-   `compile` - Method that configures the training process
    -   Arguments

        -   [loss]{.arg-text} - The loss function (objective function). The quantity that will be minimized during training. It represents a measure of success for the task at hand.
            -   Available Loss Functions: `ls(pattern = "^loss_", "package:keras")`
        -   [optimizer]{.arg-text} - Determines how the network will be updated based on the loss function. It implements a specific variant of stochastic gradient descent (SGD).
            -   Available Optimizers: `ls(pattern = "^optimizer_", "package:keras")`
        -   [metrics]{.arg-text} - The measures of success you want to monitor during training and validation, such as classification accuracy. Unlike the loss, training will not optimize directly for these metrics. As such, metrics don’t need to be differentiable.
            -   Available Metrics: `ls(pattern = "^metric_", "package:keras")`

    -   Using optimizer defaults and available losses and metrics

        ``` r
        model <-
          keras_model_sequential() |>
          layer_dense(1)

        model |>
          compile(optimizer = "rmsprop",
                  loss = "mean_squared_error",
                  metrics = "accuracy")
        ```

    -   Specifying optimizer options and functions for custom losses and metrics

        ``` r
        model |>
          compile(
            optimizer = optimizer_rmsprop(learning_rate = 1e-4),
            loss = my_custom_loss,
            metrics = c(my_custom_metric_1, my_custom_metric_2)
          )
        ```
-   `fit` - Implements the training loop
    -   Arguments

        -   [data]{.arg-text} - The inputs and targets to train on.
            -   Input Types: R arrays, tensors, or a TensorFlow Dataset object.
        -   [epochs]{.arg-text} - How many times the training loop should iterate over the data passed.
        -   [batch_size]{.arg-text} - The batch size to use within each epoch of mini-batch gradient descent: the number of training examples considered to compute the gradients for one weight update step.

    -   The return object contains a metrics property, which is a named list of their per-epoch values for "loss" and specific metric names

        ``` r
        str(model_fitted$metrics)
        #> List of 2
        #> $ loss : num [1:5] 14.2 13.6 13.1 12.6 12.1
        #> $ binary_accuracy: num [1:5] 0.55 0.552 0.554 0.557 0.559
        ```

## TensorFlow Install

-   Notes from Video: [Install TensorFlow with GPU Support using WSL2 on Windows 11](https://www.youtube.com/watch?v=402DciWGvt8)
-   See
    -   [System and Software Requirements](https://www.tensorflow.org/install/pip#windows-wsl2)
        -   Make sure your hardware and software are supported
        -   On Windows (not WSL2)
            -   Check NVIDIA driver version
                -   Never install the NVIDIA Linux driver in WSL2. You only need the Windows driver.

                -   In powershell (not WSL2)

                    ``` powershell
                    nvidia-sdi
                    ```

                -   I did not check or do anything about these items and my set-up works fine:

                    -   CUDA® Toolkit 12.3
                    -   cuDNN SDK 8.9.7
                    -   *(Optional)* TensorRT to improve latency and throughput for inference
            -   Install the Latest Microsoft Visual C++ Redistributable version
                -   I don't think this is required for WSL2 installations, but the video guy installed it, so I did too.
    -   [Misc \>\> Update Python \>\> Linux](misc.qmd#sec-misc-updatepy-lin){style="color: green"} to update Python on WSL2 (do this only at user level, not root)
-   Python
    -   Make sure the pre-installed version of Python (or the version you'd prefer to use) and pip on your Ubuntu version are supported.
        -   Check pip

            ``` bash
            # See which pip you're using
            which pip
            pip --version

            # See where packages would be installed
            python3.12 -m site --user-site
            ```

        -   Install pip

            ``` bash
            # Install pip for Python 3.12
            curl -sS https://bootstrap.pypa.io/get-pip.py | python3.12
            sudo apt install python3-pip
            ```
-   Process
    -   Set-up the venv and install a few base packages

        ``` bash
        sudo apt update && sudo apt upgrade -y

        # Create a dedicated ML environment
        python3.12 -m venv tf_env
        source tf_env/bin/activate

        pip install --upgrade pip setuptools wheel
        ```

        -   setuptools - Handles build/distribution functionality

    -   Inside the venv:

        -   Install TensorFlow

            ``` bash
            pip install tensorflow[and-cuda]

            python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
            #> [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
            ```

            -   It will download and install a bunch of packages includind CUDA packages, keras, and tensorflow. The tensorflow takes an relatively long time to install (5min?) to the point in makes you question everything is okay (it is).

            -   The second command will print some additional stuff that I didn't understand, but the last line is what's important. It should say "GPU" in it.

        -   Install JupyterLab and ipykernel

            ``` bash
            pip install jupyterlab
            ```

            -   Needed to run notebooks

        -   Register your venv with jupyter

            ``` bash
            python -m ipykernel install --user --name=tf_env --display-name="Python (tf_env)"
            ```

            -   Allow jupyter to use your venv
            -   May need install ipykernel first: `pip installl ipykernel`
            -   If you already have VSCode and a jupyter server running, you'll need to close VSCode. Then, in WSL2 while in your venv, cd to your project directory, and type `code .`
                -   This will open VSCode, open that directory, and run WSL,

    -   VSCode

        -   After installing the WSL extension (microsoft), hit ctrl + shift + p, and type "wsl". Then select Connect to WSL

            -   In the bottom left corner, it'll say, "WSL Ubuntu-22.04"

        -   In WSL2 venv, navigate to the directory that you want to do work in.

            ``` bash
            cd Documents/Python/Projects/deeplearning-with-r/
            ```

        -   In VSCode, hit ctrl + k, ctrl + o to open a folder. Copy the path from WSL2 where you've navigated to the project directory and paste in the VSCode window

            ```
            /mnt/c/Users/user/Documents/Python/Projects/deeplearning-with-r
            ```

        -   Hit ctrl + shift + p, select "Create New Jupyter Notebook". It will ask you to install the Microsoft Python Extension inside WSL2. Click the okay and install it. Then close the extension tab and go back to the notebook.

        -   Type something and ctrl + s to save it in the project folder

            ``` python
            import tensorflow as tf
            print(tf.config.list_physical_devices('GPU'))
            ```

        -   To run the cell, choose your venv kernel

            -   In the upper right of the notebook, Click "select kernel", Click "Select Environment", and finally select your venv.
-   R
    -   Misc

        -   Option: Install RStudio Server in WSL2

            -   See [Using RStudio Server in Windows WSL2](https://support.posit.co/hc/en-us/articles/360049776974-Using-RStudio-Server-in-Windows-WSL2)
            -   You'll then use RStudio through you browser

    -   Inside RStudio,

        ``` r
        install.packages("languageserver")
        remotes::install_github("nx10/httpgd")
        ```

        -   These things need installed for working outside of WSL2, but I'm going to get ahead and include them here, too.
        -   Later on, [{languageserver}]{style="color: #990000"} also gets installed within WSL2
        -   [{httpgd}]{style="color: #990000"} is a graphic device and isn't required but recommended (I think)

    -   Using Eddelbuettel's script to install R and add r2u CRAN repo for ubuntu R package binaries

        ``` bash
        # switch to root
        sudo su

        wget -qO- https://raw.githubusercontent.com/eddelbuettel/r2u/master/inst/scripts/add_cranapt_jammy.sh | bash

        # go back to user
        exit
        ```

        -   It follows CRAN's [Instructions](https://cran.r-project.org/bin/linux/ubuntu/) for installing R on Ubuntu except it doesn't install some libs at the beginning. Everything worked out though, so I guess Eddelbuettel knew what he was doing.

    -   Inside the activated virtual environment (e.g. [tf_env]{.var-text})

        -   Start R and install packages and register the R kernel with Jupyter.

            ``` r
            install.packages(c("languageserver", "IRkernel", "reticulate"))
            install.packages("tensorflow") # make sure this installs after those
            install.packages("keras3") # keras was deprecated

            IRkernel::installspec(user = TRUE)
            ```

            -   For installing packages, it doesn't matter if you're in the virtual environment, but for registering the kernel it does. Since jupyterlab was installed in the venv, R needs to start inside the venv in order to find it (hierarchical scoping).

            -   You won't get an output after registering, but you can check that kernel got registered inside bash: .

                -   There should be a kernel named, "ir." It'll just be named, "R," inside of Juypter.

            -   [user = TRUE]{.arg-text} says register it at the user level and not at root.

                -   You can register at the root level, but you would need to have started R with `sudo R`.

        -   Makes sure the WSL exension is installed in VSCode/Positron.

        -   Goto your desired project directory, and open up VSCode/Positron, `code .` / `positron .` .

    -   Inside VSCode/Positron:

        -   The WSL extension will have automatically connected to WSL2. Hit `ctrl + shift + p`, select "Create New Jupyter Notebook".

        -   In the upper right of the notebook, Click "Select Kernel", Click "Select Jupyter Kernel", and finally select "R."

            -   A pop-up (bottom-right) will ask you to install the R extension inside WSL2. Go ahead and do it.

        -   Inside a cell, check that GPU is available ([Docs](https://tensorflow.rstudio.com/install/custom))

            ``` r
            library(tensorflow)
            use_virtualenv("/mnt/c/Users/user/tf_env")

            tf$config$list_physical_devices("GPU")
            #> [[1]]
            #> PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
            ```

        -   Goto File \>\> Close Remote Connection to go back to non-WSL VSCode.