Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Rethink temporal pooling - discrete transitions at higher levels. #32

Merged
merged 17 commits into from
Oct 6, 2015

Conversation

floybix
Copy link
Member

@floybix floybix commented Sep 29, 2015

This rewrite comes from taking seriously the need for sequence learning at higher levels. At higher levels we have temporal slowness: cells stay active for longer, i.e. several time steps. If there is a uniform cortical algorithm then we need the usual sequence learning to work under temporal slowness. I think that means discrete transitions.

I define a threshold fraction of stable inputs to a layer for it to be engaged. Only when a layer is newly engaged does it replace its active columns based on the stable input. At other times, these active columns and cells continue to stay active. The exception is any columns which have a definite (high) match to input bits, and the relative influence of these over continuing columns is controlled by a parameter (temporal-pooling-max-exc).

Active columns learn on proximal dendrites as long as the layer is engaged, meaning it has continuing stable input.

For learning on distal dendrites, define:

  • learning cells - the new winning cells, excluding any continuing.
  • learnable cells - the previous winning cells, excluding any continuing.

And each time step, the learning cells can grow synapses to the learnable cells. This applies in all layers, not just higher temporal pooling layers. Notably it has a big effect on gradual continuous sequence learning, such as with the coordinate encoder. That will probably cause problems because there might not be enough coincidences of some cells starting while others are stopping. Maybe the learnable cells should remain learnable for a few time steps.

I'm not sure the old way was much better because it would end up with a lot of cells connecting to the other ones representing the same coordinate, which is not useful sequence information.

Obviously, this is all experimental. I haven't really experimented with it to see how the temporal pooling properties hold up. But what we had didn't work anyway, so might as well replace it.

This rewrite comes from taking seriously the need for sequence learning at higher levels. At higher levels we have temporal slowness: cells stay active for longer, i.e. several time steps. If there is a uniform cortical algorithm then we need the usual sequence learning to work under temporal slowness. I think that means discrete transitions.

I define a threshold fraction of stable inputs to a layer for it to be "effectively stable". Only when a layer begins a period of stability does it replace its active columns based
on the stable input. At other times, these active columns and cells continue to stay active. The exception is columns which have a definite (high) match to input bits, and the relative influence of these over continuing columns is controlled by a parameter (temporal-pooling-max-exc).

Active columns learn on proximal dendrites as long as the layer is effectively stable.

For learning on distal dendrites, define:
* learning cells are the new winning cells, excluding any continuing.
* learnable cells are the previous winning cells, excluding any continuing.

Obviously, this is all experimental. Actually I haven't even run it yet.
…igher level layer is confused.

also fix a couple of bugs.
@floybix
Copy link
Member Author

floybix commented Sep 29, 2015

With this change we grow distal synapses from source A to target B if their activation lines up in series, like

AAAA
    BBBB

Maybe we should also grow when they overlap but are clearly ordered, like

AAAA
  BBBB

and maybe

AAAA
  BB

This could be implemented by defining learning cells to be newly active winners (as with this change), but allowing source learnable cells to be all winners, not just the ones turning off.

However that would allow this connection to be learned which is questionable:

AAAA
 BB

@floybix
Copy link
Member Author

floybix commented Sep 29, 2015

OK I think we need to handle continuous sequences, so going with (what I just wrote,) allowing distal learning between cells overlapping in time, by making all winner cells learnable (but winners are still only learning when they first become active).

Just looking at coordinate encoder demo, there is a lot of non-sequential learning going on because, while a column stays active, the winner cell in a column often switches under the influence of distal (predictive) excitation. Since these are "new" winner cells they do distal learning. Ends up with a lot of noise.

An obvious solution would be to force the winner cell in a column to be fixed until the column turns off. However I don't want to do that because, the initial context of a column might be wrong, and e.g. top-down feedback should be able to resolve the context to the correct cell in a column even while the column stays active.

I think instead we could allow the winner cell in a column to switch according to total excitation (as now), but if it is in a continuing active column it should not be learning.

@cogmission
Copy link
Member

Hi Felix,

I don't mean to chime in where I'm not invited - but it seems you are doing
some very critical thinking and experimentation. Your implementations are
far ahead of the curve, and I don't understand why you choose not to do
this kind of thinking in the general theoretical forum? It doesn't seem to
be only a "comported" oriented subject? I may be wrong, but it seems like
you would get the benefit of a wider knowledgable audience, and the general
public would get the benefit of your excellent work?

Just a thought...

Cheers,
David

On Tue, Sep 29, 2015 at 9:22 AM, Felix Andrews notifications@github.com
wrote:

OK I think we need to handle continuous sequences, so going with (what I
just wrote,) allowing distal learning between cells overlapping in time, by
making all winner cells learnable (but winners are still only learning
when they first become active).

Just looking at coordinate encoder demo, there is a lot of non-sequential
learning going on because, while a column stays active, the winner cell in
a column often switches under the influence of distal (predictive)
excitation. Since these are "new" winner cells they do distal learning.
Ends up with a lot of noise.

An obvious solution would be to force the winner cell in a column to be
fixed until the column turns off. However I don't want to do that because,
the initial context of a column might be wrong, and e.g. top-down feedback
should be able to resolve the context to the correct cell in a column even
while the column stays active.

I think instead we could allow the winner cell in a column to switch
according to total excitation (as now), but if it is in a continuing active
column it should not be learning.


Reply to this email directly or view it on GitHub
#32 (comment)
.

With kind regards,

David Ray
Java Solutions Architect

Cortical.io http://cortical.io/
Sponsor of: HTM.java https://github.com/numenta/htm.java

d.ray@cortical.io
http://cortical.io

@floybix
Copy link
Member Author

floybix commented Sep 29, 2015

Well that didn't help much because the winners kept switching, even if not learning. We really don't want to break the connection between learning and learnable cells. But in combination with a change to keep winner cells stable when all else is equal, we are getting somewhere (this was used in higher levels but applies equally to gradual continuous sequences).

In fact, howdidinotrealisethisbefore. In gradually changing continuous sequences we have cells remaining active over time while we are within its coordinate range, say. That shares a lot of properties with temporal pooling at higher levels. Look at this, in the single-layer coordinates-2d demo: (time goes right, columns sorted for clarity)
screen shot 2015-09-29 at 10 40 50 pm

It shows we are now correctly predicting the onset of active cells. The red (bursting) states are just the initially-predicted ones continuing. And maybe we don't want to keep predicting them... just as we don't want to keep predicting the current state in a stable temporal pooling layer.

Hmm. Anyway I'm going to bed now.

@mrcslws
Copy link
Collaborator

mrcslws commented Sep 29, 2015

@cogmission Speaking for myself, I think of nupic-theory as "I want Jeff to read this". These side forums are a staging area for legitimized paragraphs :)

@cogmission
Copy link
Member

@marcus I see. I just didn't know who was reading this, and I don't see any
responses (other than yours) which makes me think that this work may not
get the benefit of feedback? Seems to me that Numenta is going to
eventually focus their attention on stability and the ground that Felix
covers may get wasted unless he simply solves everything first? Others
should get the benefit of yours and Felix's hard work without having to
think through the same repetitive process... Just a thought...

On Tue, Sep 29, 2015 at 11:04 AM, Marcus Lewis notifications@github.com
wrote:

@cogmission https://github.com/cogmission Speaking for myself, I think
of nupic-theory as "I want Jeff to read this". These side forums are a
staging area for legitimized paragraphs :)


Reply to this email directly or view it on GitHub
#32 (comment)
.

With kind regards,

David Ray
Java Solutions Architect

Cortical.io http://cortical.io/
Sponsor of: HTM.java https://github.com/numenta/htm.java

d.ray@cortical.io
http://cortical.io

@floybix
Copy link
Member Author

floybix commented Sep 30, 2015

@cogmission a good point. I will ask for help from the nupic-theory list as you suggest, but I will see if I can consolidate my thoughts a bit first to avoid wasting everyone's time.

…ences. skip distal learning on continuing temporal pooling cells.
@floybix
Copy link
Member Author

floybix commented Sep 30, 2015

I was confusing myself about gradual continuous sequences. The way the learning actually works is really weird. Because there is some level of stability in columns, winner cells tend to have similar distal inputs between time steps, so stay active; they also continue to learn on distal segments, extending them to reflect the slight changes between steps. At some point the context changes past a threshold and a new winner takes over. So you get these self-organising transitions. It is easier to understand in the interactive demo.

I don't think it's perfect, as there is a lot of redundant learning going on, but it does seem to work quite well, at least visually. Sampling rate is an open question.
(time goes right, columns sorted for clarity)
coordinates demo with all winners learning

So anyway all my comments in this thread can be ignored and replaced with this for distal learning:

  • learnable cells are always all the column winners;
  • learning cells are all the column winners at the first level, but excluding any continuing winners at higher levels (while temporal pooling).

@floybix
Copy link
Member Author

floybix commented Oct 1, 2015

There is a problem with this whole approach which is obvious in retrospect. (In fact I now remember that I realised this before, in my first attempt at temporal pooling, but forgot about it.)

Recall that as soon as the first level becomes predictable, the higher (temporal pooling) layer "engages" and fixes its active columns; they then keep growing new dendrites to encompass the following predictable sequence.

The problem is, just because a sequence is recognised as predictable does not mean it is resolved into a unique identity, and of course it cannot in general be resolved uniquely until the whole sequence has been seen. For example seeing the letters "t,h,e" vs "t,h,r,e,e". The sequence is predicted at "h" but not uniquely. If we freeze the pooled representation at that point it will be identical for "the" and "three".

One way to go is Numenta's "Union Pooler" approach - I only have a vague and possibly incorrect understanding: throughout a predicted sequence, more and more cells get added to the temporal pooling representation. Therefore the final representation should have some unique component. The nice part is that the union representation should include bits from all steps of a sequence, so you get semantic overlap with similar sequences. I'm not sure how you get this to be stable enough to model higher level sequences.

Another way might be to use an attention-like mechanism to "engage" the temporal pooling once the predictions have been resolved down to a single path.

@cogmission
Copy link
Member

For example seeing the letters "t,h,e" vs "t,h,r,e,e". The sequence is
predicted at "h" but not uniquely. If we freeze the pooled representation
at that point it will be identical for "the" and "three".

Correct me if I'm wrong, but shouldn't the behavior be either a prediction
of "the" or a prediction of "three" until the 3rd letter is reached? Isn't
that ok? Also, doesn't scope of prediction also fall into play? What I mean
is, aren't there times when we remember the "gist" of something but maybe
mistake or confuse one or more details? Should we be thinking of the pooler
as a resolver or a generalizer? When I originally read your mail my first
thought was that the HTM doesn't "resolve" anything necessarily. It felt
more true for me that the HTM merely predicts and later receives
reinforcement for that prediction in the form of a successful prediction -
if we think about the HTM as "resolving" things, then we get into the arena
of meta-oversight. My inclination is to think that oversight isn't
happening at all, and that the system merely functions like a Monad that
doesn't have contextual awareness, but that awareness is emergent by the
prediction mechanism?

Also the choice of "the" or "three" seems like it would be dependent on the
previous input (maybe way way back) to provide successful context
recognition?

Just some random thoughts...

Cheers,
David

On Thu, Oct 1, 2015 at 9:17 AM, Felix Andrews notifications@github.com
wrote:

There is a problem with this whole approach which is obvious in
retrospect. (In fact I now remember that I realised this before, in my
first attempt at temporal pooling, but forgot about it.)

Recall that as soon as the first level becomes predictable, the higher
(temporal pooling) layer "engages" and fixes its active columns; they then
keep growing new dendrites to encompass the following predictable sequence.

The problem is, just because a sequence is recognised as predictable does
not mean it is resolved into a unique identity, and of course it cannot in
general be resolved uniquely until the whole sequence has been seen. For
example seeing the letters "t,h,e" vs "t,h,r,e,e". The sequence is
predicted at "h" but not uniquely. If we freeze the pooled representation
at that point it will be identical for "the" and "three".

One way to go is Numenta's "Union Pooler" approach - I only have a vague
and possibly incorrect understanding: throughout a predicted sequence, more
and more cells get added to the temporal pooling representation. Therefore
the final representation should have some unique component. The nice part
is that the union representation should include bits from all steps of a
sequence, so you get semantic overlap with similar sequences. I'm not sure
how you get this to be stable enough to model higher level sequences.

Another way might be to use an attention-like mechanism to "engage" the
temporal pooling once the predictions have been resolved down to a single
path.


Reply to this email directly or view it on GitHub
#32 (comment)
.

With kind regards,

David Ray
Java Solutions Architect

Cortical.io http://cortical.io/
Sponsor of: HTM.java https://github.com/numenta/htm.java

d.ray@cortical.io
http://cortical.io

…ulate during predicted sequences.

just a rough sketch, untested and not fully thought out.
@floybix
Copy link
Member Author

floybix commented Oct 3, 2015

So I did a kind of implementation of a union pooler.

@floybix
Copy link
Member Author

floybix commented Oct 3, 2015

A puzzle that comes up when we think about sequence learning at higher levels:

How do we maintain a "bursting" column state in a higher level layer? (If a transition was not predicted, the newly activated columns should burst, activating all their cells / contexts.) I'm assuming that the same mechanism should apply at all levels.

Under temporal pooling, cells in a column may stay active for several time steps; for sequence learning this could be either a single predicted cell, or many bursting cells. I make this work by setting a level of persistent temporal pooling excitation on all newly active cells.

Apart from keeping multiple predictions open, the other role of bursting is in defining feed-forward outputs from the layer as being "stable" or not, which is used in temporal pooling at still higher levels. This seems to suggest we should define bursting simply by whether all cells in a column are (continuing to be) active. However, that definition can't apply in the first level if we have one cell per column. And it seems to lose the essence of "bursting" in being defined by a (lack of) predictive depolarising potential.

In practice we seem to be left with a composite definition of bursting: by predictive potential on newly-active / first-level steps, and all-cells-per-column during temporal pooling phases.

@floybix
Copy link
Member Author

floybix commented Oct 3, 2015

@mrcslws please review. Um sorry about the 13 commits... do you think I should squash them?

@mrcslws
Copy link
Collaborator

mrcslws commented Oct 4, 2015

I'm taking some time to study this change and become opinionated about it. I should have something coherent to say tomorrow (Sunday my time). The 13 commits are fine by me. Feel free to merge without me, I can always comment on commits.

@floybix
Copy link
Member Author

floybix commented Oct 4, 2015

Sure, there's no urgency about it. And thanks.

On Sunday, 4 October 2015, Marcus Lewis notifications@github.com wrote:

I'm taking some time to study this change and become opinionated about it.
I should have something coherent to say tomorrow (Sunday my time). The 13
commits are fine by me. Feel free to merge without me, I can always comment
on commits.


Reply to this email directly or view it on GitHub
#32 (comment)
.

Felix Andrews / 安福立
http://www.neurofractal.org/felix/

@@ -434,8 +442,14 @@
(if good? (conj good-ids id) good-ids)
(if (and good? (< exc min-good-exc)) exc min-good-exc)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specific to this commit. But the winner dominates all other cells if it dominates... a single cell that's above the threshold? Shouldn't we instead just use filter/remove to get rid of the dominated cells case-by-case?

Alternately, if dominance is all-or-none, it should keep track of the second best excitation, not the lowest excitation above the threshold.

I might be looking at this wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should just use filter/remove.

(let [new-ac (if newly-engaged?
ac
(set/difference ac (:active-cells state)))]
(into tp-exc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to piece this together...

  • If the layer is not engaged, then all newly active cells get their tp-exc set to the max. Those that were already active don't, so they decay in each timestep.
  • If the layer is newly engaged, all active cells get their tp-exc set to the max.
  • If the layer is engaged (but not newly), then all newly active cells get their tp-exc set to the max. The others decay in each timestep (because of the commented out true ;(not engaged?))

Am I reading it right? I haven't grokked the change yet, I'm expecting it to all crystallize after I sleep on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. But I think it makes more sense if you consider the proximal excitation together with the temporal pooling excitation.

  • if the layer is newly engaged, any existing TP is cleared, columns are selected using all proximal input, and the activated cells get the full TP amount.
  • if the layer is continuing engaged, columns are selected using all proximal input together with the existing TP. Because we select a larger number of columns each step, they will be the existing TP ones plus some new ones. The new ones (cells) get the full TP amount.
  • if the layer is not engaged, no proximal input comes through, with the exception of any columns having a high match. So TP columns will just continue. The idea was to carry forward context (including transition predictions) rather than forgetting it. But I just realised that it resets the activation level so it won't carry forward unchanged, d'oh.

A thought, inspired by this, "Novel input should appear stable": when a layer is newly not engaged, i.e. something novel appears, we could reset TP and start pooling again! That way we build up a representation of the novel thing, until we start to recognise something and reset (transition) again. But if not engaged then we don't learn proximally.

Whether TP amounts should always decay, or only when the layer is not engaged, I'm not sure. The former would bias the representation towards more recent steps in the pooled sequence. Since we allocate more columns for each step this seems like it should not be necessary, but it would be once we fill up to the maximum density/sparsity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the layer is not engaged, no proximal input comes through

Ok, yeah, because of the (select-keys (keys ff-good-paths)), and because of inhibition from TP excitation. Given the current numbers, the inhibition will have a much larger effect than this threshold, since the threshold for "good" is just 12 (:ff-seg-new-synapse-count) while the TP excitation starts at 50.

I'm just narrating in case it uncovers a flaw in my understanding.

So really the :ff-stimulus-threshold is just for engaged layers. Non-engaged layers use the higher "good" threshold.

@mrcslws
Copy link
Collaborator

mrcslws commented Oct 6, 2015

👍

floybix added a commit that referenced this pull request Oct 6, 2015
Rethink temporal pooling - discrete transitions at higher levels.
@floybix floybix merged commit d2435e8 into htm-community:master Oct 6, 2015
@floybix
Copy link
Member Author

floybix commented Oct 6, 2015

Just merging to carry on with experiments, not because this is finished by any stretch of the imagination.

@floybix floybix deleted the rethink-temporal-pooling branch October 6, 2015 05:36
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants