-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathrecognizing-patterns-notes
574 lines (491 loc) · 22.6 KB
/
recognizing-patterns-notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
Transcript of what I meant to say in presenting the sides here:
recognizing-patterns.pdf AKA
https://github.com/opencog/learn/blob/master/learn-lang-diary/recognizing-patterns.pdf
A recording of the presentation is here:
https://www.youtube.com/watch?v=PLSjqHtyw3E
Title page:
* Hi, My name is linas Vepstas, you can call me Linas.
* I want to tell you about a theory of grammar that works not only
for natural language, but also, I beleive, for audio, video, or
any sensory input. And not just low-level stuff, but also for highlevel
functions, such as common-sense reasoning and natural deduction.
* not only that, but it is possible to write algorithms that learn
this grammar in an unsupervised setting: without any prior
human-selected inputs.
* As a bonus, the theory is symbolic, not sub-symbolic, and as such,
it provides for explainable, comprehensible patterns, thus offering
an answer to the question of understanding how it thinks.
XXX alt text
* I'm here to talk about the automated, unsupervised learning
of explainable, comprehendable patterns in language,
and in a general setting that includes audio and video,
or anything -- the noosphere.
Slide 1:
* I'm going to start by making some outrageous claims.
* I'm going to claim that the system being presented here is capable of
solving the famous Frame Problem, the Symol Grounding Problem.
* That common-sense reasoning can be learned.
* Not only are these claims the holy grail of AGI, which is why they are
outrageous,
* But I will present an aglortihm that is not particularly complicated.
* It consists of some fairly well-known parts that many others have
fooled with over the decades.
* Its not in uncharted territory.
* In this sense, most of what I will say has been said before.
* It might even seem obvious, or simplisitic to you, or even naive.
* I'd like to ask for your attention none-the-less, as you may find
the particular combination here surprising and unusual.
* Bear with me. I'll start out simple.
Everything in the universe is a graph
* This is the universe not just of physical stars and planets,
but also the universe of thoughts and concepts, the noosphere.
* Sparse graphs are necessarily labeled, as otherwise there is no
way of talking about the things in the graph.
* The edges are necessarily labelled by the vertexes at thier
end-points.
Graphs are decomposable
* This should be obvious, but is rarely mentioned in graph theory.
* You can cut an edge to get two half-edges.
* After cutting the egde, you need to label the cut ends so that
you know how to join them again.
* The metaphor I will use repeatedly is that of the jigsaw puzzle-piece
connector.
* Jigsaw puzzle pieces are designed so that the mating
of the pieces is forced.
* When you cut an edge, attach a connector to it.
* In this example, polarity is indeterminate; flip a coin.
* In the grand scheme of things, you will discover that connectors
can be multi-polar, multi-sexual.
* This is perhaps politically controversial, but biologists have found
fungi with 43 different types of sex.
* Some of these can mate, and produce viable offspring, and some cannot.
Its quite strange.
* I've drawn a picture of a telescope jigsaw-puzzle piece, with
connectors for looking at the sun and the moon.
* There are other things that can see the sun and the moon: eyes and
lenses. They have the same jigsaw shape.
* These shapes define the syntax of the graph.
* The syntax is exactly that collection of shapes that allow connections
to occur.
* This might not be the defnition of syntax that you are comfortable
with, or that you have learned.
* It is, however, entirely equivalent to the more conventional
definitions you may know.
* This is not a particularly deep claim, it's fairly well-known and
well-understood, I'm sorry only in not having a good reference for it
at my fingertips.
Graphs are Compositional
* This is so basic, so obvious that
We are like fish in water, we are not aware of the water.
That's why I have to say these things out loud.
* At the top of this chart, I've drawn a conventional definition of a
term algebra.
* A term algebra consists of a set of function symbols taking arguments.
The arguments can be constants, variables, or other terms.
* Plugging thing togther has a name: its called "beta reduction"
terms are always DAG's, directed acyclic graphs.
* People in symbolic AI and proof theory and programming and compilers
and chip design work with term algbras all the time. Its the stock
of trade.
* The only point here is that jigsaw connectors generalize the
concepts of term algebras. One can assemble things that aren't just
DAG's but can be any but directed or undirected graph, in general.
* The connectors are type theoretical types.
* Types have structure; types can have a very complex structure.
* Thus, perhaps the jigsaw diagram is misleading: that connector
looks so simple.
* It can, in fact be, quite very complex.
* By they way: a type theory type is the same thing as it is in computer
science.
* It can be an int a float, a string: these are primitive types.
* It can be a compound type: a list, an OO class, a function signature.
* I'm drawing a jigsaw connector to represent any and all of these things.
* I warned told you that this will all be simple, yet confusing at the same
time. Perhaps it begins here.
Slide physics:
* In the next two slides, I simply want to indicate that these ideas are
just absolutely everywhere.
* All over the place.
* Here's a quick tour.
* Anything tensorial.
* On the left: cobordisms, from string theory.
* On the right: natural language as a quantum algebra.
* The actual paper there is quite interesing.
* Coecke is saying that you can parse natural language quickly and
easily with quantum algorithms.
* Anyway: Composition is like contracting tensor indexes.
* Connecting connectors is like contracting tensor indexes.
Chemistry Biology
* Some examples from chemistry and biology.
* A chemical reaction has inputs and outputs.
* These are the "connectors", they must connect, for a reaction to
happen.
* Prusinkiewicz has been developing algorithmic botany for the last
three decades. It is beautiful and stunning and relevant to AGI.
You should check it out.
Link Grammar:
* My fetish for jigsaw connectors comes from Link Grammar.
* The diagram here is taken from the original 1991 LG paper.
That paper explicitly talks about jigsaw-puzzle pieces.
* Above, you can see some words with connectors.
* Below, a parse:
S for the subject-verb linkage
O for the object-verb linkage
D for connecting to the dterminer, which English demands.
* The diagram drawn here is correct for this sentence, but
as a warning:
* I'm glossing over some more complex ideas in the connector
type matching system. As I said earlier, types need not be primitive
types; they can be compound types.
* A few quick side remarks: the LG grammar can be converted into HPSG,
into dependency grammars, into categorial grammars, functional
grammars.
* This can be done algorithmically.
* There's some eye-watering mathematical proofs of this stuff out there.
* If you haven't head of LG before ... well, now you have. It's real.
* There are complete dictionaries for English and Russian.
* There are smaller demo dictionaries for ten more.
* It's legit, linguistically-speaking; hundreds of published papers.
Vision:
* Now for the good stuff.
* The jigsaw paradigm can be applied to the analysis of shapes and
colors.
* I've drawn the abstract graph for the corresponding image.
* There are connectors.
* Some connectors connect north-south directions
* Some connectors connect to colors
* Some connectors connect to round shapes.
* Some connectors connect to a background.
* So its not just shape, or just color: some of these connections
are abstract.
* A key point: it is not about pixels. (!!)
* Shape grammars really are different than pixel-collections.
Audio:
* We can do this for sound, too.
* This is a whale song from NOAA.
* Its a fast-fourier-transform FFT spectrum.
* On the bottom left, you can see a sequence if seven regularly spaced
low frequency chirps.
* You know what a chirp is. Tweet.
* A rising or a falling pitch.
* The diagram on the right is a structural representation of the
first 10 seconds of the FFT spectrum.
* I've tried to draw the diagram that captures and extracts that
structure.
* AH HAH!
* But now we come to a key point.
* Where did those meaningful filters come from?
* How did we know to look for chirps?
* Surely it is not because some grad student hand-coded up a bunch of
filters just so.
* I mean, you could do that, but that is not the point of AGI.
* You want to discover this filters from scratch. From Nothing at all.
* So how do you do that?
* How do you find structure from nothing?
Part Two
* Now we come to the dense, mathematical part of this talk.
* I will go fairly quickly, because many of these ideas are
well-known in the machine-learning industry.
* I will show how to build up a graph structure from pair-wise
correlations.
* Then I will show how to generalize from the particulars of
specific graphs.
* This generalization turns out to be isomorphic to clustering,
and to matrix factorization.
* You can use either clustering algorithms to do this, or matrix
factorization algorithms to do this.
* I bet you can even use neural nets to do this.
* No seriously! You can! No one has done it, but you can!
* There's gonna be a lot of neural net fans at ths conference,
I can't steal thier thunder.
So,the details.
Lexical attraction
* Start with a frequentist approach to probability.
* Observe things, and count how often they occur.
* Observing pairs is the easiest.
* Blast through some large corpus of English text.
* Say, a million sentences.
* Count how often you see word pairs: words close to one-another.
* 2-grams is the fanyc word for it.
* This gives you a count, call it N(u,w) of the word on the left and the
word on the right.
* Divide by the total number of pairs to get a probability,
a frequentist probability.
* The star here, the asterisk, is a wild-card.
* Divide the pair probability by the marginal probability of the left
word, and the marginal probability of the right word.
* Take the logarithm
* This is basic textbook MI
* I'll call it Lexical attraction because that's what Yuret called it
and because I need to distinguish from another MI, coming up.
* Otherwise, its the same thing.
* The diagram here is from Yuret's thesis.
* What he did here is very clever:
He drew the maximum spanning tree.
* It is a tree that connects every word in the sentence,
and the edges are choosen so that the grand total MI is the largest
possible.
* Note that the LA/MI for "Northern Ireland" is huge. That's because
when these two words occur in a text, they commonly occur together.
* You don't much see "Northern" with that many other things, and
you don't see lot of other kinds of Ireland.
* These two words occur together. That is what a high MI score tells
you.
* You can extract a grammar from this.
Lexical entries
* Here's how.
* You've drawn the maximum spanning tree.
* Now cut the egdes, and form the labelled connectors.
* Here are several different notations.
* The graphical one, of course.
* A text version.
* The minus signs mean "connect to the left"
* The text version uses an ampersand to say you must have both
connectors.
* It turns out this ampersand is the conjunction from a fragment of
linear logic.
* It's not just some naive boolean-and; its a bit more sublte.
Interesting.
* The next row is tensorial notation.
* Very popular among physicists and those with a formal mathematics
background.
* Tensorial notation popular in quantum-influenced thinking,
its very comfortable for them. A known thing.
* These notations are the same.
* Jigsaw puzzle connectors are tensorial, that's the message.
* Oh, note that the whole thing looks like a skip-gram.
If you're from the neural net world, you know skip-grams.
This is like that, but different. A hah!
* I write d for disjunct because the connector sequences are disjoined.
* By that I mean, you have a menu choice:
you can throw a ball, catch a ball, see a ball,
* But you can only use one of these words at a time in a sentence
Choose one: choose an item from a menu.
* Its a disjunction.
* Now we do our frequentist thing again.
* Run through your corpus, a million sentences, and count
how often you see a word-disjunct pair.
* You get a count N(w,d)
* Turn the count into a probability.
* Given a word, write down the counts for that word.
* The result is something that looks like a vector.
* I've written it as a vector here.
* There's something a bit clever here: the plus sign can be interpreted
as both literal, mathematical addition.
* But also as disjunction: the plus sign is a menu choice.
* To build a sentence you must pick one.
P(w,d) tells you how often to pick that choice.
Shades of wave-function collapse! OMG!
* hat-e is a unit basis vector.
* OK you've got a bunch of these. Now what?
* Generalize from the specific to the particular.
(Skip a slide) I've got my slides in the wrong order.
Learning a Lexis
* From "throw the ball" I want to generalize to "verb determiner noun"
How?
* By similarity.
(Back a slide)
I need a similarity score.
Similarity scores
* Cosine distance is commonly used in machine learning.
* Its not that good.
* Problem is that cosine distance was designed to be invariant in
Euclidean space. It's ideal for that. Its a Casimir invariant.
* Problem with assuming Euclidean space is that taking orthogonal
projections can lead to negative probabilities. Its pathological.
You can't just take the orthogonal complement, you get nonsense.
* You need something that works in probability space.
* Probability space is not Euclidean, its a simplex:
a triangle, a tetrahedron in 3D, a something-gon in 4D...
* Anyway experimental measurements show that cosine distance is low quality
* MI is well-behaved under Markov (affine) transformations.
* As before, star means sum over everything
* Think of the marginals here as a weighted word probability.
* There's even a feynmann diagram for this: the marginals
are the vacuum contributions! So you can get quite fancy with the
theory, here, if you want.
* Now a breif diversion. I graphed the distribution of MI.
This graph is for 2/3rds of a million word pairs.
* Its on a log scale. Its a parabola.
* A parabola on a log scale is just a Bell curve. A gaussian. A normal
distribution.
* Why the heck do word pairs, from the English language, when pumped
through a grammar detector from MST parses, why would these form
a normal distribution?
* Really? Why? Who knows. I don't think anyone knows.
This gives a hint of how this entire topic is lacking in theoretical
foundations.
* I'd love to know why. My pet theory is that its a GUE, but I can't yet
show this.
Anyway...
* So we are learning to generalize.
* The biggest problem with MI similarity is that very rare words often
have a very large MI.
* It is a poor strategy (I think!?) to start assigning rare words
to grammatical classes.
* You want to start with common words. But how?
* Well, here's a proposal: Take the average probability of the two
words, take the log, and add that.
* Now, averaging means dividing by two.
* In a log, that turns into a square root
* So you get this funny looking square-root here.
* You won't find this in textbooks, but its completely natural.
I think it's interesting.
Factorization
* When I call two different words "a noun", I am assigning them to a
"grammatical class".
* All words in that grammatical class behave similarly.
* What happens when I do this?
* I am going from a specific matrix of (word-disjunct) pairs to
a matrix of (grammatical class, syntactic structure) pairs.
* This is matrix factorization.
* I'm factorizing the matrix P into left, center and right parts.
* The left and right parts are high-dimensional and sparse.
* The central part is low-dimensional and dense. and highly connected.
* This is the defacto organization of all of the existing Link Grammar
dictionaries.
* The people who wrote them were not thinking of matrix factorization,
but it is, defacto, exactly what they've done.
* Tediously, and by hand, but that is what it is.
* So I'm not just making this up or blowing hot smoke.
* This is what linguists actually do, when they write a lexis.
Key insight
* Neural nets can accurately capture the dense, interconnected central
region, That's why they work.
* But they necessarily perform dimensional reduction on
the sparse left and right factors.
* By erasing the sparse factors, neural nets become no longer
interpretable.
* There's magic hidden in those neural net weights.
* The goal of the algorithm I've described here is to do what
those neural nets do, but to do it in an interpretable way.
Boom. That's it. I'm done. Lets pause to reflect.
* What I've shown you is an algorithm for extracting structure
from observations of a corpus of data.
* It starts by noting pair-wise correlations.
* These can be used to form Maximum Spanning Tree parses.
* These are broken up into Jigsaw pieces, which are counted.
* These are organized into piles by similarity.
* The result is a grammar that describes the corpus.
* The resulting grammar is a conentional, ordinary linguistic-theory
grammar.
So what?
Let me resume with some trivialities.
Compositionality
* Assembled portions of a jigsaw puzzle are just like a single jigsaw
puzzle piece: an area with a bunch of unconnected connectors.
* On the left is an idiom or insitutional phrase.
* Its a couple of connected words, but there are still some unconnected
connectors.
* On the right is an example of anaphora resolution.
* We have a grammar for the words in the sentences,
How about connections between sentences?
* Repeat the process. Move up the hierarchy.
* This is hardly a new idea, people have talked about this
for many decades.
* What's new here:
* I've presented you with an algorithm and a theoretical framework
for which software can be written (is being written).
Let me move on. We're near the end here.
Something from Nothing
* We got lucky with the words
* We need to find processors that convert sensory input into
something like "words".
* Some kind of block that we can apply the algo to.
* How?
* The proposal: generate random processing pipelines, and ... try them
out.
* On the left is some made-up, random audio pipeline, a lopass filter
and a chirp filter.
* On the right is some other random pipeline. Literally random.
* Apply them to audio data.
* Is there correlation?
* A little, or a lot?
* If these filters have some reasonable MI between them, then they
are recognizing actual patterns actually present in the input.
* If not, throw them away, and try again.
* The claim is -- these will eventually hill-climb thier way to
meaningful filter sequences.
* This is kind of like a classical optimzation problem.
* The difference is that the "objective function" is a good
old-fashioned maximum entropy principle.
* Again, this might seem like an old idea, or a naive idea.
Nothing particularly new.
* What is new is that we can couple this to the grammar learning algo,
and walk up the hierarchy of abstraction.
* This is the part that has never been done before.
* The hierarchy allows for feedback from the high-level layers,
back down and back up.
* Note also, everything is symbolc, and everything is interpretable.
Symbol grounding problem
* This is how we get to interpretability.
* Any engineer can look at that filter tree, and say "oh I know exactly
what it does. OMG, its even labelled: its a low-pass filter, and a
chirp filter. So obvious."
* What is a "whistle"? Oh, it is a specific filter sequence.
* What does the word "whistle" mean?
* "I know a whistle when I hear it" but how can I explain it?
* Grammar mining has discovered what the word means. It means
this particular sound, described by this filter sequence.
* The frame problem asks: what parts of the environment are
important for making decisions?
* Answer: those parts of the environment that have been filtered
out by the current active filters.
* But what are those?
* Answer: they were learned, by applying mutual information to
extract similarity, and then factorize/generalize into common
situations we find ourselves in.
Common Sense Reasoning
* Olde-school AI work on reasoning harks to the foundations of
mathematical logic developed in the first half of the 20th century.
* The trope is that common sense is somehow like mathematical logic.
* "The rational actor" of the enlightenment, of economics.
(Its not. That's not what common sense is.)
* Doctor Doctor, it hurts whenn I do this. Well, don't do that.
* Never explain a joke.
* But this is important.
* This is funny precisely because the logic is correct, but it
fails the meta problem of what doctors do, of why we went to
the doctor in the first place.
* The common-sense reasoning context is that this is taking place in a
doctors office, which drags in the frame (the frame problem) of
everything that doctors do, which include curing ailments.
* By the way,
(Frames are like the lexical functions of MTT. They tell you how to
navigate the network of relations.)
* If we can learn lexical functions, then we can learn common-sense
behavior patterns, and common-sense reasoning.
* We can learn the network of what happens in a doctors office, and
how to navigate that network. (The same way we navigate lexical
functions).
* We can determine that the advice "don't do that" fails to match the
"rational" expectations of what happens in a doctors office.
* But what are the "rational expectations" of what happens in a doctors
office? Oh, well, it is that learned network.
* That's the joke, explained.
* We can learn the rules of reasoning; they are not God-given (aka
hard-coded by some programmer.)
* They can be learned, and I've described an algorithm for learning
them.
Conclusions
* I skipped over experimental results. There are a lot of them.
* I've been working on this project for a long time.
* The experimental results ... well,
* They come out of the blue - like the gaussian distribution
I showed earlier.
* I have no clue why. Nor do any search engines.
* Most of what I've talked about seems simple and obvious enough,
yet surprisingly there is very little published work that is
relevant.
* Its a grand unknown, unexplored territory. What I've glimpsed
so far looks great but who knows -- its an exploration.
* Its an opportunity for any student who want to make a mark on
the AGI world.
----
Thank you. Questions?
=======================================================
----
is this a gaussian unitary ensemble?
Extra material:
the definition of syntax here is the same as the conventinoal defn of
syntax.