You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
assertnot (num_residual_streams<=1andneural_memory_qkv_receives_diff_views), 'allow neural memory queries, keys, values to be derived from different combinations of the residual streams can only work if hyper connections has greater than 1 residual stream'
num_layer_choices= (layer-1) *4+1# for each layer, have memory input select from attn inp, attn out, ff inp, and ff out - plus one for the current point in the residual stream (memory input)
Copy file name to clipboardExpand all lines: train_mac.py
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -48,6 +48,7 @@
48
48
STORE_ATTN_POOL_CHUNKS=True# whether to use attention pooling for chunk derived momentum, per-layer lr mod, decay
49
49
MEMORY_MODEL_PER_LAYER_LEARNED_LR=True
50
50
NEURAL_MEM_WEIGHT_RESIDUAL=True# learning to accept contributions from the weights of the previous neural mem layer brings about significant improvements. this was improvised and not in the paper, but inspired by the value residual learning free lunch paper
51
+
NEURAL_MEM_QKV_RECEIVES_DIFF_VIEW=True# will allow the neural memory to select what layers from which to derive queries / keys / values, effectively allowing it to graft itself to the transformer in any way to be beneficial. this is to address an issue from a phd student who noted that the mem network is learning nothing more than wk @ wv. this also generalizes all possible ways to connect the neural memory to a transformer, a sort of NAS
0 commit comments