learn-lang-diary/recognizing-patterns-notes


Transcript of what I meant to say in presenting the sides here:
recognizing-patterns.pdf AKA

https://github.com/opencog/learn/blob/master/learn-lang-diary/recognizing-patterns.pdf

A recording of the presentation is here:
https://www.youtube.com/watch?v=PLSjqHtyw3E

Title page:
* Hi, My name is linas Vepstas, you can call me Linas.
* I want to tell you about a theory of grammar that works not only
  for natural language, but also, I beleive, for audio, video, or
  any sensory input.  And not just low-level stuff, but also for highlevel
  functions, such as common-sense reasoning and natural deduction.
* not only that, but it is possible to write algorithms that learn
  this grammar in an unsupervised setting: without any prior
  human-selected inputs.
* As a bonus, the theory is symbolic, not sub-symbolic, and as such,
  it provides for explainable, comprehensible patterns, thus offering
  an answer to the question of understanding how it thinks.

XXX alt text
* I'm here to talk about the automated, unsupervised learning
  of explainable, comprehendable patterns in language,
  and in a general setting that includes audio and video,
  or anything -- the noosphere.

Slide 1:
* I'm  going to start by making some outrageous claims.
* I'm going to claim that the system being presented here is capable of
  solving the famous Frame Problem, the Symol Grounding Problem.
* That common-sense reasoning can be learned.
* Not only are these claims the holy grail of AGI, which is why they are
  outrageous,
* But I will present an aglortihm that is not particularly complicated.
* It consists of some fairly well-known parts that many others have
  fooled with over the decades.
* Its not in uncharted territory.

* In this sense, most of what I will say has been said before.
* It might even seem obvious, or simplisitic to you, or even naive.
* I'd like to ask for your attention none-the-less, as you may find
  the particular combination here surprising and unusual.
* Bear with me. I'll start out simple.

Everything in the universe is a graph
* This is the universe not just of physical stars and planets,
  but also the universe of thoughts and concepts, the noosphere.
* Sparse graphs are necessarily labeled, as otherwise there is no
  way of talking about the things in the graph.
* The edges are necessarily labelled by the vertexes at thier
  end-points.

Graphs are decomposable
* This should be obvious, but is rarely mentioned in graph theory.
* You can cut an edge to get two half-edges.
* After cutting the egde, you need to label the cut ends so that
  you know how to join them again.
* The metaphor I will use repeatedly is that of the jigsaw puzzle-piece
  connector.
* Jigsaw puzzle pieces are designed so that the mating
  of the pieces is forced.
* When you cut an edge, attach a connector to it.
* In this example, polarity is indeterminate; flip a coin.

* In the grand scheme of things, you will discover that connectors
  can be multi-polar, multi-sexual.
* This is perhaps politically controversial, but biologists have found
  fungi with 43 different types of sex.
* Some of these can mate, and produce viable offspring, and some cannot.
  Its quite strange.

* I've drawn a picture of a telescope jigsaw-puzzle piece, with
  connectors for looking at the sun and the moon.
* There are other things that can see the sun and the moon: eyes and
  lenses.  They have the same jigsaw shape.
* These shapes define the syntax of the graph.
* The syntax is exactly that collection of shapes that allow connections
  to occur.

* This might not be the defnition of syntax that you are comfortable
  with, or that you have learned.
* It is, however, entirely equivalent to the more conventional
  definitions you may know.
* This is not a particularly deep claim, it's fairly well-known and
  well-understood, I'm sorry only in not having a good reference for it
  at my fingertips.

Graphs are Compositional
* This is so basic, so obvious that
  We are like fish in water, we are not aware of the water.
  That's why I have to say these things out loud.
* At the top of this chart, I've drawn a conventional definition of a
  term algebra.
* A term algebra consists of a set of function symbols taking arguments.
  The arguments can be constants, variables, or other terms.
* Plugging thing togther has a name: its called "beta reduction"
  terms are always DAG's, directed acyclic graphs.

* People in symbolic AI and proof theory and programming and compilers
  and chip design work with term algbras all the time. Its the stock
  of trade.

* The only point here is that jigsaw connectors generalize the
  concepts of term algebras.  One can assemble things that aren't just
  DAG's but can be any but directed or undirected graph, in general.

* The connectors are type theoretical types.
* Types have structure; types can have a very complex structure.
* Thus, perhaps the jigsaw diagram is misleading: that connector
  looks so simple.
* It can, in fact be, quite very complex.

* By they way: a type theory type is the same thing as it is in computer
  science.
* It can be an int a float, a string: these are primitive types.
* It can be a compound type: a list, an OO class, a function signature.

* I'm drawing a jigsaw connector to represent any and all of these things.

* I warned told you that this will all be simple, yet confusing at the same
  time.  Perhaps it begins here.

Slide physics:
* In the next two slides, I simply want to indicate that these ideas are
  just absolutely everywhere.
* All over the place.
* Here's a quick tour.
* Anything tensorial.
* On the left: cobordisms, from string theory.
* On the right: natural language as a quantum algebra.
* The actual paper there is quite interesing.
* Coecke is saying that you can parse natural language quickly and
  easily with quantum algorithms.

* Anyway: Composition is like contracting tensor indexes.
* Connecting connectors is like contracting tensor indexes.

Chemistry Biology
 * Some examples from chemistry and biology.
 * A chemical reaction has inputs and outputs.
 * These are the "connectors", they must connect, for a reaction to
   happen.
 * Prusinkiewicz has been developing algorithmic botany for the last
   three decades. It is beautiful and stunning and relevant to AGI.
   You should check it out.

Link Grammar:
* My fetish for jigsaw connectors comes from Link Grammar.
* The diagram here is taken from the original 1991 LG paper.
  That paper explicitly talks about jigsaw-puzzle pieces.
* Above, you can see some words with connectors.
* Below, a parse:
  S for the subject-verb linkage
  O for the object-verb linkage
  D for connecting to the dterminer, which English demands.
* The diagram drawn here is correct for this sentence, but
  as a warning:
* I'm glossing over some more complex ideas in the connector
  type matching system. As I said earlier, types need not be primitive
  types; they can be compound types.

* A few quick side remarks: the LG grammar can be converted into HPSG,
  into dependency grammars, into categorial grammars, functional
  grammars.
* This can be done algorithmically.
* There's some eye-watering mathematical proofs of this stuff out there.

* If you haven't head of LG before ... well, now you have. It's real.
* There are complete dictionaries for English and Russian.
* There are smaller demo dictionaries for ten more.
* It's legit, linguistically-speaking; hundreds of published papers.

Vision:
* Now for the good stuff.
* The jigsaw paradigm can be applied to the analysis of shapes and
  colors.
* I've drawn the abstract graph for the corresponding image.
* There are connectors.
* Some connectors connect north-south directions
* Some connectors connect to colors
* Some connectors connect to round shapes.
* Some connectors connect to a background.
* So its not just shape, or just color: some of these connections
  are abstract.

* A key point: it is not about pixels. (!!)
* Shape grammars really are different than pixel-collections.

Audio:
* We can do this for sound, too.
* This is a whale song from NOAA.
* Its a fast-fourier-transform FFT spectrum.
* On the bottom left, you can see a sequence if seven regularly spaced
  low frequency chirps.
* You know what a chirp is. Tweet.
* A rising or a falling pitch.
* The diagram on the right is a structural representation of the
  first 10 seconds of the FFT spectrum.
* I've tried to draw the diagram that captures and extracts that
  structure.

* AH HAH!
* But now we come to a key point.
* Where did those meaningful filters come from?
* How did we know to look for chirps?
* Surely it is not because some grad student hand-coded up a bunch of
  filters just so.
* I mean, you could do that, but that is not the point of AGI.
* You want to discover this filters from scratch. From Nothing at all.
* So how do you do that?
* How do you find structure from nothing?

Part Two
* Now we come to the dense, mathematical part of this talk.
* I will go fairly quickly, because many of these ideas are
  well-known in the machine-learning industry.
* I will show how to build up a graph structure from pair-wise
  correlations.
* Then I will show how to generalize from the particulars of
  specific graphs.
* This generalization turns out to be isomorphic to clustering,
  and to matrix factorization.
* You can use either clustering algorithms to do this, or matrix
  factorization algorithms to do this.
* I bet you can even use neural nets to do this.
* No seriously! You can! No one has done it, but you can!
* There's gonna be a lot of neural net fans at ths conference,
  I can't steal thier thunder.

So,the details.

Lexical attraction
* Start with a frequentist approach to probability.
* Observe things, and count how often they occur.
* Observing pairs is the easiest.
* Blast through some large corpus of English text.
* Say, a million sentences.
* Count how often you see word pairs: words close to one-another.
* 2-grams is the fanyc word for it.
* This gives you a count, call it N(u,w) of the word on the left and the
  word on the right.
* Divide by the total number of pairs to get a probability,
  a frequentist probability.
* The star here, the asterisk, is a wild-card.
* Divide the pair probability by the marginal probability of the left
  word, and the marginal probability of the right word.
* Take the logarithm
* This is basic textbook MI
* I'll call it Lexical attraction because that's what Yuret called it
  and because I need to distinguish from another MI, coming up.
* Otherwise, its the same thing.

* The diagram here is from Yuret's thesis.
* What he did here is very clever:
  He drew the maximum spanning tree.
* It is a tree that connects every word in the sentence,
  and the edges are choosen so that the grand total MI is the largest
  possible.

* Note that the LA/MI for "Northern Ireland" is huge.  That's because
  when these two words occur in a text, they commonly occur together.

* You don't much see "Northern" with that many other things, and
  you don't see lot of other kinds of Ireland.
* These two words occur together.  That is what a high MI score tells
  you.

* You can extract a grammar from this.

Lexical entries
* Here's how.
* You've drawn the maximum spanning tree.
* Now cut the egdes, and form the labelled connectors.
* Here are several different notations.
* The graphical one, of course.
* A text version.
* The minus signs mean "connect to the left"

* The text version uses an ampersand to say you must have both
  connectors.
* It turns out this ampersand is the conjunction from a fragment of
  linear logic.
* It's not just some naive boolean-and; its a bit more sublte.
  Interesting.
* The next row is tensorial notation.
* Very popular among physicists and those with a formal mathematics
  background.
* Tensorial notation popular in quantum-influenced thinking,
  its very comfortable for them. A known thing.
* These notations are the same.
* Jigsaw puzzle connectors are tensorial, that's the message.
* Oh, note that the whole thing looks like a skip-gram.
  If you're from the neural net world, you know skip-grams.
  This is like that, but different. A hah!

* I write d for disjunct because the connector sequences are disjoined.
* By that I mean, you have a menu choice:
  you can throw a ball, catch a ball, see a ball,
* But you can only use one of these words at a time in a sentence
  Choose one: choose an item from a menu.
* Its a disjunction.

* Now we do our frequentist thing again.
* Run through your corpus, a million sentences, and count
  how often you see a word-disjunct pair.
* You get a count N(w,d)
* Turn the count into a probability.

* Given a word, write down the counts for that word.
* The result is something that looks like a vector.
* I've written it as a vector here.

* There's something a bit clever here: the plus sign can be interpreted
  as both literal, mathematical addition.
* But also as disjunction: the plus sign is a menu choice.
* To build a sentence you must pick one.
  P(w,d) tells you how often to pick that choice.
  Shades of wave-function collapse! OMG!

* hat-e is a unit basis vector.

* OK you've got a bunch of these. Now what?
* Generalize from the specific to the particular.
  (Skip a slide)  I've got my slides in the wrong order.

Learning a Lexis
* From "throw the ball" I want to generalize to "verb determiner noun"
  How?
* By similarity.

(Back a slide)
I need a similarity score.

Similarity scores
* Cosine distance is commonly used in machine learning.
* Its not that good.
* Problem is that cosine distance was designed to be invariant in
  Euclidean space. It's ideal for that.  Its a Casimir invariant.

* Problem with assuming Euclidean space is that taking orthogonal
  projections can lead to negative probabilities. Its pathological.
  You can't just take the orthogonal complement, you get nonsense.

* You need something that works in probability space.
* Probability space is not Euclidean, its a simplex:
  a triangle, a tetrahedron in 3D, a something-gon in 4D...

* Anyway experimental measurements show that cosine distance is low quality
* MI is well-behaved under Markov (affine) transformations.
* As  before, star means sum over everything
* Think of the marginals here as a weighted word probability.
* There's even a feynmann diagram for this: the marginals
  are the vacuum contributions! So you can get quite fancy with the
  theory, here, if you want.

* Now a breif diversion.  I graphed the distribution of MI.
  This graph is for 2/3rds of a million word pairs.
* Its on a log scale. Its a parabola.
* A parabola on a log scale is just a Bell curve. A gaussian. A normal
  distribution.
* Why the heck do word pairs, from the English language, when pumped
  through a grammar detector from MST parses, why would these form
  a normal distribution?
* Really? Why? Who knows. I don't think anyone knows.
  This gives a hint of how this entire topic is lacking in theoretical
  foundations.
* I'd love to know why. My pet theory is that its a GUE, but I can't yet
  show this.

Anyway...
* So we are learning to generalize.
* The biggest problem with MI similarity is that very rare words often
  have a very large MI.
* It is a poor strategy (I think!?) to start assigning rare words
  to grammatical classes.
* You want to start with common words. But how?

* Well, here's a proposal:  Take the average probability of the two
  words, take the log, and add that.
* Now, averaging means dividing by two.
* In a log, that turns into a square root
* So you get this funny looking square-root here.
* You won't find this in textbooks, but its completely natural.
  I think it's interesting.

Factorization
* When I call two different words "a noun", I am assigning them to a
  "grammatical class".
* All words in that grammatical class behave similarly.
* What happens when I do this?
* I am going from a specific matrix of (word-disjunct) pairs to
  a matrix of (grammatical class, syntactic structure) pairs.
* This is matrix factorization.
* I'm factorizing the matrix P into left, center and right parts.
* The left and right parts are high-dimensional and sparse.
* The central part is low-dimensional and dense. and highly connected.

* This is the defacto organization of all of the existing Link Grammar
  dictionaries.
* The people who wrote them were not thinking of matrix factorization,
  but it is, defacto, exactly what they've done.
* Tediously, and by hand, but that is what it is.
* So I'm not just making this up or blowing hot smoke.
* This is what linguists actually do, when they write a lexis.

Key insight
* Neural nets can accurately capture the dense, interconnected central
  region, That's why they work.
* But they necessarily perform dimensional reduction on
  the sparse left and right factors.
* By erasing the sparse factors, neural nets become no longer
  interpretable.
* There's magic hidden in those neural net weights.
* The goal of the algorithm I've described here is to do what
  those neural nets do, but to do it in an interpretable way.

Boom. That's it. I'm done. Lets pause to reflect.

* What I've shown you is an algorithm for extracting structure
  from observations of a corpus of data.
* It starts by noting pair-wise correlations.
* These can be used to form Maximum Spanning Tree parses.
* These are broken up into Jigsaw pieces, which are counted.
* These are organized into piles by similarity.
* The result is a grammar that describes the corpus.
* The resulting grammar is a conentional, ordinary linguistic-theory
  grammar.

So what?
Let me resume with some trivialities.

Compositionality
* Assembled portions of a jigsaw puzzle are just like a single jigsaw
  puzzle piece: an area with a bunch of unconnected connectors.
* On the left is an idiom or insitutional phrase.
* Its a couple of connected words, but there are still some unconnected
  connectors.
* On the right is an example of anaphora resolution.
* We have a grammar for the words in the sentences,
  How about connections between sentences?
* Repeat the process. Move up the hierarchy.
* This is hardly a new idea, people have talked about this
  for many decades.
* What's new here:
* I've presented you with an algorithm and a theoretical framework
  for which software can be written (is being written).

Let me move on. We're near the end here.

Something from Nothing
* We got lucky with the words
* We need to find processors that convert sensory input into
  something like "words".
* Some kind of block that we can apply the algo to.
* How?
* The proposal: generate random processing pipelines, and ... try them
  out.
* On the left is some made-up, random audio pipeline, a lopass filter
  and a chirp filter.
* On the right is some other random pipeline. Literally random.

* Apply them to audio data.
* Is there correlation?
* A little, or a lot?
* If these filters have some reasonable MI between them, then they
  are recognizing actual patterns actually present in the input.
* If not, throw them away, and try again.
* The claim is -- these will eventually hill-climb thier way to
  meaningful filter sequences.

* This is kind of like a classical optimzation problem.
* The difference is that the "objective function" is a good
  old-fashioned maximum entropy principle.

* Again, this might seem like an old idea, or a naive idea.
  Nothing particularly new.
* What is new is that we can couple this to the grammar learning algo,
  and walk up the hierarchy of abstraction.
* This is the part that has never been done before.

* The hierarchy allows for feedback from the high-level layers,
  back down and back up.

* Note also, everything is symbolc, and everything is interpretable.


Symbol grounding problem
* This is how we get to interpretability.
* Any engineer can look at that filter tree, and say "oh I know exactly
  what it does. OMG, its even labelled: its a low-pass filter, and a
  chirp filter. So obvious."

* What is a "whistle"? Oh, it is a specific filter sequence.
* What does the word "whistle" mean?
* "I know a whistle when I hear it" but how can I explain it?
* Grammar mining has discovered what the word means. It means
  this particular sound, described by this filter sequence.

* The frame problem asks: what parts of the environment are
  important for making decisions?
* Answer: those parts of the environment that have been filtered
  out by the current active filters.
* But what are those?
* Answer: they were learned, by applying mutual information to
  extract similarity, and then factorize/generalize into common
  situations we find ourselves in.

Common Sense Reasoning
* Olde-school AI work on reasoning harks to the foundations of
  mathematical logic developed in the first half of the 20th century.
* The trope is that common sense is somehow like mathematical logic.
* "The rational actor" of the enlightenment, of economics.
  (Its not. That's not what common sense is.)

* Doctor Doctor, it hurts whenn I do this. Well, don't do that.

* Never explain a joke.
* But this is important.
* This is funny precisely because the logic is correct, but it
  fails the meta problem of what doctors do, of why we went to
  the doctor in the first place.
* The common-sense reasoning context is that this is taking place in a
  doctors office, which drags in the frame (the frame problem) of
  everything that doctors do, which include curing ailments.

* By the way,
  (Frames are like the lexical functions of MTT. They tell you how to
  navigate the network of relations.)
* If we can learn lexical functions, then we can learn common-sense
  behavior patterns, and common-sense reasoning.
* We can learn the network of what happens in a doctors office, and
  how to navigate that network. (The same way we navigate lexical
  functions).
* We can determine that the advice "don't do that" fails to match the
  "rational" expectations of what happens in a doctors office.
* But what are the "rational expectations" of what happens in a doctors
  office? Oh, well, it is that learned network.
* That's the joke, explained.
* We can learn the rules of reasoning; they are not God-given (aka
  hard-coded by some programmer.)
* They can be learned, and I've described an algorithm for learning
  them.

Conclusions
* I skipped over experimental results. There are a lot of them.
* I've been working on this project for a long time.
* The experimental results ... well,

* They come out of the blue - like the gaussian distribution
  I showed earlier.

* I have no clue why. Nor do any search engines.
* Most of what I've talked about seems simple and obvious enough,
  yet surprisingly there is very little published work that is
  relevant.
* Its a grand unknown, unexplored territory. What I've glimpsed
  so far looks great but who knows -- its an exploration.

* Its an opportunity for any student who want to make a mark on
  the AGI world.

----
Thank you. Questions?

=======================================================
----

is this a gaussian unitary ensemble?

Extra material:
the definition of syntax here is the same as the conventinoal defn of
syntax.