-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Suggestion to try "substitution" or "context" pooling #33
Comments
@robjfr Did hell just freeze over? :) Rob Freeman on Github!? Welcome!!! |
@robjfr Can I offer my go-to response? Are you suggesting we insert an overt managing mechanism with no biological plausibility deep within an HTM? :-P Also, I see and understand what you're driving at by pooling based on sequence affinity - but I wonder how or if we'll find such a mechanism in the neuroscience? |
@cogmission Hi David. On Github? My projects all precede Github :-) Re. "an overt managing mechanism with no biological plausibility deep within an HTM?" FWIW I think this all fits perfectly with any biological evidence. At this point the implementation is just spreading activation in a network. Making it natural for a network is the whole issue. My vector implementation works. We now want to figure out how it might be implemented (better) in a network. |
I've been looking at Felix's experiment with spreading activation to count paths (commit above.) It gets some nice results, but there's a problem with spreading activation that loops and gives high scores to repeated sequences. Thinking about it, you could say this is because retreating to column states we've thrown away all context. A repeated state is the same state. Maybe I was wrong to persuade Felix to abandon labelling on cell states, go to column states and map context only by spreading activation along paths. I wanted to move away from cell states because they distinguish paths on context, and I wanted to cluster paths on context. But I forgot that for distributed states you can have both. I might have distinct cell states and still cluster on them get the measure of counted paths that I want. The difference with Felix's labelled cell states would be that I was labelling (pooling boundary) on the variety/number of different cells/paths, not state identity on an exact match of cells. So I'm now thinking that the way to identify pooling boundaries based on context independence might not be to spread activation along paths, but to measure the distribution of cell states for a given column state. Where are given state has a variety of cell states, that should indicate a pooling boundary (on the principle of context independence.) A question is, how much context is coded into current cell states? How far down the path does it distinguish? Clustering context using the distribution of cell states rather than spreading activation might even still allow us to generalize along the lines of shared contexts in my diagram in this post to NuPIC-theory: http://lists.numenta.org/pipermail/nupic-theory_lists.numenta.org/2015-September/003216.html
There I try to capture generalization of "seq" with 34278 and 87523 based on common contexts 1 ... 2, and 3 ... 4. If e.g. prediction by "8" affects "4"'s cells, and that affects/clusters the predictive cells of "seq". Actually that generalization clustering looks very much like the cell based SDR clustering Felix was proposing in the first place. Though perhaps it needs the addition of a mechanism to map cell state to context independently of columns, so states can be identified based on contexts they share, independently of their column identity. Note the contrast between clustering contexts to generalize, and dispersing contexts to identify. An independent element should occur in a variety of contexts, but generalize on similarity of contexts. If all of the above is true, then we should forget spreading activation, and start thinking about ways to identify based on the dispersal of cell states against concentration of column state, and generalize based on the concentration of cell states against dispersal of column state. Actually, that might be a nice principle: identity based on dispersal of cell states against concentration of column state, and generalization based on concentration of cell states against dispersal of column state. Note state will be a matter of degree. We shouldn't have labels (beyond perhaps input labels.) But we should be able to highlight different (degrees of) groupings using, say, sliders representing the above (dispersal of cell states against concentration of column state, and concentration of cell states against dispersal of column state) parameters. Those sliders might indicate identity (bracketings/hierarchy) along a sequence, and generalization (substitutions) across a sequence. |
Just considering identity, not generalisation, to begin with.
I'm not clear how we could use multi-contextual columns to define pooling boundaries. I asked this on the mailing list before: http://lists.numenta.org/pipermail/nupic-theory_lists.numenta.org/2015-September/003194.html In your diagram example you propose that the sequence
The start of this pooling could be signalled by encountering "s" which occurs in multiple contexts (i.e. there are multiple cells in its columns which have distal dendrite segments). If we also have the input
then logically we could have a signal for the end of pooling by noting that both "2" and "4" also occur in multiple contexts. However we may well have seen the other states within the pooled element in other contexts too:
Which seems to break the pooled sequence. Your response when I said this before was: "look at entire sequences of input states, and see how the contexts of those entire sequences vary." It seems to me like a kind of attention mechanism maybe, where we learn a template sequence... once we recognise a template we know which slots can vary, and we have learned to pay attention to those slots... this is very vague I know. I feel like I should be able to understand what you mean by your principle, but, well, I don't really. What would your proposed sliders do (or what is the information input & output)? |
@floybix. I probably should sleep on this. But here's an unpolished version anyway. According to the "release early, release often" principle.
I think the same answer still applies. We still want to use the same parameters. I just think there is a better way to access them. My answer to your question above was that we can't base a segmentation decision on a single state, we need to consider paths, "look at entire sequences of input states, and see how the contexts of those entire sequences vary", as you say. The problem was that if we only consider paths, we can never group paths according to context. This is because each path is made unique by its context = no grouping. This was the original problem in your example sentences with "Chifung" as I recall. Every mention followed a different sentence, giving "Chifung" a new state each time. To solve this I suggested going to the other extreme, labelling only on columns, and counting paths with a separate activation spreading mechanism. That way we could have varying paths, and at the same time group them, using column states. My mistake was thinking that to group using column states, I needed to throw out the cell information. In fact, the great advantage of distributed representation is that it can contain both. If you like we can have two variables giving four extreme possibilities:
An attention mechanism yes. Not fixed slots though. The existence of "slots" will depend upon what we are attending to. If we keep both column and cell sensitivity at the maximum, we will just identify unique strings, like your cell labelled transition diagram now. If we lower the sensitivity of both just a bit, though, then we should get little clouds of paths, and associated with them little hierarchies of bracketing column states/paths, with nested cell dispersal scores. It is now the effect of lowering sensitivity of both a bit, but keeping both, which I think will give us the result we want.
I think it would do in part what your original transition diagram did. Except in that case you had both cells and columns tightly specified. You introduced a check box which selects between possibilities 1) and 2), so you've come part of the way. What would happen if you allowed columns to vary while keeping the cell states (context mappings anyway) constant, my possibility 3), or alternatively if you allowed both to vary only partially, my possibility 4)? The output for 1) would be like your current cell state labelled transition diagram. Only with the option to merge distinct states like those you got for "Chifung" by lowering cell sensitivity. And milk the new information we get for the cell state dispersal of "Chifung". The output for 2) would be as now for the column labelled transition diagrams. Only we would be able to knock out the problem of paths looping abcabcabc (the opposite of the "Chifung" problem), and milk the same path dispersal information by directly accessing cell states. We should recognize the underlying subjectivity by not labelling states at all (except maybe input states.) Ideally groupings could be revealed using distance, like lots of little "tag clouds" around more densely clustered branch points of high dispersal values. Even the hierarchies of these dispersal values might be displayed graphically, as trees, more or less flat according to the balance between alternate sets of potential pairings. I guess the short answer to your question about states out of context giving inappropriate branching scores, is that I'm still suggesting we distinguish between them by using the path, but I'm hoping that the preceding and following paths will select a dispersal of cell states for a given context that will eliminate the problem (e.g. "set" in "Mary has a full set", "Mary set the ball rolling." I don't know if cell states code context far enough back to be effective at this now. But if not I think we could modify the cell coding so it did.) |
Trying to get my head around your distance metrics (or dispersal / branching values?). Do you mean, taking each cell activation instance to have these properties:
Using these, perform a clustering of timesteps with a distance/proximity metric that is between two extremes:
Then
And as you said, expecting that somehow a combination of these will be most useful. |
Yes. That sounds right.
I don't see us identifying these clusters as such in the output at all at this point. Not in the sense of identifying them with a label, anyway. Rather I see us presenting clusters (of input states?) directly, as a kind of "tag cloud". Here's what I'm thinking the output might look like: We could plot clouds across the sequence dimension. At the moment you plot everything in sequence, except where you merge labels and loop the sequence back on itself. Well, I'm suggesting instead of merging labels, we just stack labels beside each other when we cluster them. With distance traverse to sequence dimension varying by closeness of clustering. With sensitivity for both column and cell clustering reduced to a minimum, all states would all be stacked beside each other in one big cloud, sequence information and column identity completely lost. Or with sensitivity for both at a maximum tags would be spread out along the sequence in the order they were presented to the system. But here's where it gets interesting. Starting with minimum sensitivity for both and gradually increasing context sensitivity would gradually spread the cloud along the sequence, as highly context specific states were forced out of the cloud and into their specific sequence. This would produce the effect of looser clouds anchored by more context specific clouds. Pairs of such anchors could be selected based on their density to produce a best hierarchy of pairs of such anchor points. |
Actually, rather than plotting both context and identity as clouds across the sequence, and having a hard threshold to split them out of the cloud into sequence again, it would probably make more sense to plot column sensitivity across the sequence, and cell sensitivity along the sequence. Duh :-) That way more context sensitive states would naturally spread along the sequence as you increased context sensitivity. I think that would work. |
I couldn't see how that "tag cloud" with sensitivity sliders could work. Maybe you meant a force-directed graph layout or projection modified to include sequence information. And what exactly are the sensitivity parameters in relation to a distance metric. Anyway I added a table below the "Cell-SDRs plot" which records all the inputs in sequence, and for each time step, lists a version of your two metrics:
Does this help to pick out pooling boundaries? |
I ran the simple_sentences demo with input: a b c a b c a b c x y z a b c a b c a b c Your values for your new table were as follows:
Which looks about right (though I don't see why "x" is getting two inputs, and it seems a little odd that the first transition a->b doesn't get marked.) Anyway, "a" and "c" get the greatest diversity of contexts (I guess you count contexts forward and back, though I don't see why "c" gets 0.45 and "a" gets "0.87".) That's diversity of contexts. So "a" and "c" are least sensitive to context. Which is because "a" is preceded mostly by "c", but once by "z", and "c" precedes mostly "a", but "x" once. So to translate this to ordering information in a putative plot. States" "b", "x", "y", "z", are all very context specific. So if we have any context sensitivity to the plot at all, they are going to demand their context. That will push the sequence out. "b" will demand to be between "a" and "c". "x-y-z" will demand to occur between "c" and "a". If you like we'll have a jigsaw puzzle of sequence information for the plot, with pieces: "a-b-c", "c-x-y-z-a", but where "a" and "c" will accept both "c" and "z", or "a" and "x", respectively. If you plot out all possible orderings (cloud?) with those constraints, I think you'll get:
If you keep on increasing the context sensitivity, then at maximum sensitivity the different occurrences of "a" and "b" should distinguish themselves too, and assert that "x" only follows "c" after "a b c a b c a b", and "a" is only preceded by "z" before "b c a b c a b c". And then your plot will reduce again to: a b c a b c a b c x y z a b c a b c a b c Equally at minimum context sensitivity you'll just get a cloud without any sequence. The only parameter will be column similarity. |
The "x" row listed 2 inputs occuring in its context; its context being I added an enumeration of the contexts and alternatives for the selected I see, it sounds like you want a kind of force-directed graph layout. To be On Wednesday, 7 October 2015, Rob Freeman notifications@github.com wrote:
Felix Andrews / 安福立 |
You're right about the dimensions. Pushing everything to 2D is artificial. Even column clusterings will have many many dimensions. But I think the combination of columns and context should allow us to reduce the dimensions to a manageable level for each particular case. After all, at the limit we can always reduce our data to the single dimension of a single sequence, in the order it was presented. So if you like this is a dimension reduction mechanism. But graduated, so we can find more dimensions at will. And viewed from the other direction, this is a dimension generation mechanism. We are trying to find structure in the mess of potential relationships in the world. Pooling, if you like, is a problem of finding two dimensions (or more) in one dimensional data. Why 2D? Well, 3D might be better. How many dimensions do we resolve the world into when we think? But the bottom line is that by using a tag cloud we are admitting there is no "natural" dimension. We simplify the world to suit our purposes, but our perception is always subjective. This is a mechanism for finding meaningful structure (based on the independence of something from its environment as a theory of "identity". That's compared to repetition as a "theory of identity.") What would it give us? Pooling. For a given sensitivity to cells and columns it should segment a sequence into sections which are more independent of their environment (= greater identity) and segments which are not. |
I think you are misunderstanding how context is represented in HTM. It does not represent history arbitrarily far back. Whenever we hit an unpredicted state the columns burst, opening up all predictions. If the following state is one of those predictions, then we are in that context. So in " You can see this on the Cell SDRs diagram.
Could you explain how to select/filter a set of contexts within a larger context, if that is the right way to put it?
Well, no, it would give us a diagram. I was questioning whether such a diagram is the best way to go about deriving the information we need. I'm assuming what we are primarily aiming for is "pooling boundaries" within which to form stable and distinct representations at higher levels. |
There used to be context sensitivity which extended further back in a sequence. That is what is now called "old" TP! It may have been conceived differently, as a way of getting a more stable representation for longer sequences, but I certainly don't think anyone is going to claim long distance context dependencies are biologically implausible. I don't think context extending back exactly one discrete state is a tenet of even current HTM. We may not have that implemented, but I don't see any reason it shouldn't be.
You're probably right. At this point HTM (the CLA actually) won't be capturing the information we need. In my example it will perhaps have enough time to learn the "a b c" transitions, but I don't know if it will have learned any cell states for "x y z" on just one presentation. Or perhaps it does start to select a context cell on just one presentation?? Either way I think that is detail. If it takes more than one presentation, then it takes more than one. In that case the pooling effect will only start to appear after a few presentations (but since the pooling decision will be based on a cloud of historical strings, even new strings will be pooled..) As for a sensitivity parameter. Well, I don't know if predictive cells have connection strengths in the current implementation. Isn't that what happens with learning? If they do then a sensitivity threshold could be used to turn them on or off. The utility of doing that would only become really apparent once a state had learned many contexts. Then you could turn off just the weakest, keep only the strongest, or anything in between and adjust where it fit (in your "forced graph"?) on that basis.
Just what I wrote above. Maybe a state has only occurred in a given long sequence once (or 10 times) so the predictive cell connection is not strong. But it might have occurred in a shorter sub-sequence more often, and thus have stronger predictive cell connections for those sub-sequences. So maybe a lower sensitivity would select its ordering with respect to the shorter sub-sequence, but not force it to line up according to the entire longer context. Yes, you'd need to code longer distance dependencies for this (like "old" TP?)
This does that. The grouping {"abc", "xyz"} is a stable and distinct representation. It is stable in the sense that evidence for it grows cumulatively as more shared contexts are encountered. But, and this is new, it does not exclude new sequences either. It maps new things to combinations of old things, providing a bridge between new and old. That is a very powerful thing: a way to find structure in novelty. But I suppose this bit really does go beyond implementation and require an insight of theory. The important question at the end of the day is what kinds of groupings do you want your pooling system to find? Do you only want repeated sequences? I strongly believe that repetition should not be the only criterion for identity in a cognitive system. Instead I believe interchangeability in context is a powerful parameter. And here, while I don't like to depend on language, I believe language does provide us with the best evidence so far. A definition of identity based on repeated sequences applied to language gives us words, but only words. It cannot capture any structure above the level of a word. In fact repeated language structure is the only really good objective definition of a word. So a pooling system based on repeated sequences will enable you to pool sequences of letters into words, but it will not let you find any structure above words. If you have had difficulty finding structure in language above the level of words, that is the reason. This definition of identity based on interchangeability in context will solve that. Though I don't think its application will be limited to language structure. My guess is that it will also turn out to apply to finding structure in novel visual scenes/perspectives, and any time you need to consider novelty, or even ambiguity, in cognition in general. But yes, to see the utility of what I'm suggesting, you will need to accept the hypothesis that a meaningful cognitive definition of identity emerges from the parameter of independence from context, and not only repetition. |
Sure, I agree with your theory of meaning based on substitutions, and you've explained it well. Gradually absorbing the implications... this might take me a while. Thanks for your patience. |
@floybix Perhaps this should be a separate "issue", but how are sequences of states represented now (in pooling?) It strikes me that a sequence of states could be represented as below. Cell states predict sequence, but they should also be able to distinguish sequence. You might represent a whole sequence of column states as a single merged column state, with sequences between original sub-states of the merged whole distinguished (grouped?) by their cell states (indicating a sequence among sub-sets of columns.) In extreme simplicity, if the current CLA codes a sequence of states A B C conceptually as (each column representing an entire SDR state):
Then activating all the columns of A, B, and C, at once would produce a single merged or pooled state, which could nevertheless be decoded back to the original sequence by finding distinct sub-groupings of cells (b c a) among the columns. I don't know, perhaps this is how entire sequences are represented now in pooling. I got the impression pooled sequences were assigned novel column representations (though I'm not sure how that novel column representation can be decoded to reproduce the original sequence.) |
That's just what I'm working on now: how to represent the full context in a sequence, defining what I call "context space". But rather than doing a union within the same layer, I am looking at an accumulation of states in a higher level layer, which supplies context via feedback connections. A higher-level / pooled representation can be "decoded" in the sense of entraining the lower layer into replaying the original sequence, given an initial stimulus. It does this because the lower sequence cells are (simultaneously) predicted by the stable higher ones, which enables the lower layer to disambiguate the multiple predictions at any point in the sequence. |
I'm still catching up, at my own pace and in my own style (currently rereading On Intelligence). Your "replaying the original sequence" brought this passage to mind:
You can imagine a lower layer recounting details in a story while the higher layer navigates the high-level plot events. And sometimes the lower layer stumbles onto something that suddenly hijacks the higher layer (a tangent). |
@floybix I'm glad we're on the same page with this. I was getting tangled thinking how we might represent strings and perform the "jigsaw" step I pictured earlier. Your "decoding" sounds about right to me too. Though I see decoding as just another type of clustering across this cumulative pooled representation: a clustering of columns as they have cells in common. I guess that is the same as what you are saying. Just you are highlighting the use of an input state to initiate the clustering and begin the "decoding"?? But a pooled representation of this form seems very powerful to me. Every operation we need should be conceivable as a clustering over one or other aspect of it. To "decode" the sequence, cluster columns on cells in common, as above. But if we reduce the sensitivity of this cell clustering, the resolution of the mapping from contexts to columns would be reduced, and decoding would produce not just the next column state in the original sequence, but a new sequence of little clouds of merged column states representing a sequence of a number of individual states which tend to share contexts. This would be the "jigsaw" step. Conceptually it is a reshuffling (pooling) of states into new ordered substitution classes. The ordering of these substitution classes should give a substitution based segmentation of the original sequence, producing "meaning" for each segment, where "meaning" is in the sense of a (substitution based) organization of examples, hierarchy (and improved/smoothed predictions to boot, as something practical and testable.)
I love that passage from Jeff. I know exactly what he means! But I think we can model this all within the pooled layer. It seems to me to be exactly a failure of the "merging" process which occurs when you lower context sensitivity the way I'm talking about above. Reducing context sensitivity before performing the sequence "decoding" will form "little clouds of merged column states representing a sequence of a number of individual states which tend to share contexts". That's like collecting all examples of a similar situation in your head, and using them all to generalize about the situation, make predictions etc. When I'm talking to someone and they're rambling, I'm trying to map what they are saying to some meaningful summary: "we were late so we couldn't do X", "the engine was wet so it wouldn't start". What I really want them to say is "the engine was wet so it wouldn't start". I don't want them to say "It was a dark and stormy night......." So I think the case Jeff is talking about is a failure of generalization of the type we're discussing. A failure to perform it, a failure to smudge your sensitivity to context, and collect contextually similar examples together in a generalization of their sequence relevant to them all, it is a failure to "think clearly". But we can get both the detail of a particular day, and the general situation, by lowering the sensitivity of all our historical sequences, and clustering, all within the same layer. Just one layer for input, and one for pooling, at this stage. |
Perhaps I should emphasize: just one layer, but with some kind of "amplifier" applied to it (to reduce the sensitivity to contexts.) And that I'm suggesting this can give us a model for "thinking"(! as opposed to just remembering, or even strict pattern recognition, here we're creating new patterns): merging column representations on reduced sensitivity to context. |
By the way Felix. If you don't want to limit yourself to abstract sequences of symbols when testing this, it might be interesting to try data in the form of successive visual saccade fixations. The merging of column representations with reduction in context sensitivity for sequences of these images as the "eye" travels back and forth over the visual field, might have direct interpretation as actual perceived cognitive objects, a cognitive segmentation of the visual field. I don't know if that interpretation is going to apply. I'd have to think through whether columns could directly represent pixels, or if the retina perhaps processes them significantly first, so the column representation should not be a naive pixel map. Also, I'm not sure where the famous visual primitive neurons would fit in (simply the result of the retinal processing?) And we'd have to think what it meant when the path passed back on itself inexactly... Perhaps not quite the same thing as repeating a symbol. We would get partial matches. But an actual merging of images in substitution pooled representations, producing visually significant cognitive objects and a (hierarchical) partition of the visual field, might be something to think about. |
I've been nagging Felix about trying what I call substitution/context pooling. This is based on the intuition I've posted about in the NuPIC-theory thread below and elsewhere:
http://lists.numenta.org/pipermail/nupic-theory_lists.numenta.org/2015-September/003191.html
"Intuitively, something which has identity, a pooled state, should be something which is independent of its environment. Which is the same thing as saying it will occur in a variety of contexts."
A naive first implementation of this might be pooling based on the number of alternate historical paths between two states in a sequence.
To do this we need a way to count historical paths between two points in a sequence. As a first suggestion we might allow all presented states in a sequence to pass activation to neighbouring states in historical sequences, and then count activations at different points in the presented sequence after some number of activation iterations. Then we might order (substitution) poolings based on the activations at each state in the sequence.
Felix has already implemented some of this with a ComportexViz commit:
htm-community/sanity@7e88314
This "issue" is by way of drawing attention to this experiment, and inviting wider comment.
The text was updated successfully, but these errors were encountered: