[feature] decouple visualisation UI's topic numbering with their label #267

ed9w2in6 · 2024-04-24T12:38:48Z

We have whole family of issues that are just about the numbering of topics during visualisation:

numbering confusions
rename topic feature requests
- Renaming topics #92
bug due to incorrect indexing
- major LDAviz UI topic index bug (e.g. term, topic, etc.) #265 (fix at fix: #265 and #261 (/js/ldavis.v3.0.0.js) #266

They can all be resolved just by decoupling the numbering from labels, which also remove the need of sort_topics, and start_index options in the python API.

Now I am not going into details on how to implement or specification of outcomes, but here are some ideas:

Outline

`python` API side

We currently generate topic numbers at topic_top_term_df in _prepare.py. We use enumerate and start_index to generate the numbering, in which it is supplied by user from prepare method, smuggled through _topic_info method.

pyLDAvis/pyLDAvis/_prepare.py

Line 276 in 16800f3

    
           topic_dfs = map(topic_top_term_df, enumerate(top_terms.T.iterrows(), start_index))

Sorting is orthogonal to this logic, hence we can safely ignored it when changing such code:

pyLDAvis/pyLDAvis/_prepare.py

Lines 413 to 416 in 16800f3

    
           if (sort_topics): 
        
               topic_proportion = (topic_freq / topic_freq.sum()).sort_values(ascending=False) 
        
           else: 
        
               topic_proportion = (topic_freq / topic_freq.sum())

The number generated from enumerate will ultimately be used to name the topic, stored as Category:

pyLDAvis/pyLDAvis/_prepare.py

Line 265 in 16800f3

'Category': 'Topic%d' % new_topic_id,

I believe we should allow user to supply a list of strings.

If we change this we need to change this too:

pyLDAvis/pyLDAvis/_prepare.py

Lines 443 to 449 in 16800f3

    
           class PreparedData(namedtuple('PreparedData', ['topic_coordinates', 'topic_info', 'token_table', 
        
                                                          'R', 'lambda_step', 'plot_opts', 'topic_order'])): 
        
               def sorted_terms(self, topic=1, _lambda=1): 
        
                   """Returns a dataframe using _lambda to calculate term relevance of a given topic.""" 
        
                   tdf = pd.DataFrame(self.topic_info[self.topic_info.Category == 'Topic' + str(topic)]) 
        
                   if _lambda < 0 or _lambda > 1:

and made sure none of them are named "Default", since we used it as default:

pyLDAvis/pyLDAvis/_prepare.py

Lines 237 to 242 in 16800f3

    
           default_term_info = pd.DataFrame({ 
        
               'saliency': saliency, 
        
               'Term': vocab, 
        
               'Freq': term_frequency, 
        
               'Total': term_frequency, 
        
               'Category': 'Default'})

And that is for topic_info data only, we have to do the same of mdsData and token_table too.
Clearly a better way is just to side-step it and just supply a desired list of names and store into the PreparedData namedtuple.

Solution: side step at JS visualisation side

Currently, our visualisation logic made hard assumptions that Category must be in the form of "TopicN" where N is a number:

pyLDAvis/pyLDAvis/js/ldavis.js

Lines 697 to 701 in 16800f3

    
           function reorder_bars(increase) { 
        
               // grab the bar-chart data for this topic only: 
        
               var dat2 = lamData.filter(function(d) { 
        
                   return d.Category == "Topic" + vis_state.topic; 
        
               });

Therefore, again, the path of lowest friction is to side-step it only changing the visualisation logic:

RHS Table title

pyLDAvis/pyLDAvis/js/ldavis.js

Lines 982 to 987 in 16800f3

    
           .attr("y", -30) 
        
           .attr("class", "bubble-tool") //  set class so we can remove it when highlight_off is called 
        
           .style("text-anchor", "middle") 
        
           .style("font-size", "16px") 
        
           .text("Top-" + R + " Most Relevant Terms for Topic " + topics + " (" + Freq + "% of tokens)");

circle label

pyLDAvis/pyLDAvis/js/ldavis.js

Lines 388 to 393 in 16800f3

    
           .style("font-size", "11px") 
        
           .style("fontWeight", 100) 
        
           .text(function(d) { 
        
               return d.topics; 
        
           });

In which 2 is optional. So only 3 changes in total!

Summary, changes needed

new parameter for topic names
store it at PreparedData
change RHS Table title, optionally the circle labels too

The text was updated successfully, but these errors were encountered:

msusol · 2024-04-24T15:12:17Z

Are you creating a matching pull request?

ed9w2in6 · 2024-04-25T04:40:51Z

@msusol Yes, still WIP though. Ideally cleaning up the code base would be better but I do not have such plans.
My plan is to just, as mentioned above, a quick hack:

adding new param at prepare, default to None, some logic to generate dummy topic name if None.
store it at PreparedData
change the visualisation accordingly:
- RHS Table title
- the circle labels too if it looked good.
- allow select topic by topic name too, if not too difficult

msusol self-assigned this Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] decouple visualisation UI's topic numbering with their label #267

[feature] decouple visualisation UI's topic numbering with their label #267

ed9w2in6 commented Apr 24, 2024 •

edited

Loading

msusol commented Apr 24, 2024

ed9w2in6 commented Apr 25, 2024

[feature] decouple visualisation UI's topic numbering with their label #267

[feature] decouple visualisation UI's topic numbering with their label #267

Comments

ed9w2in6 commented Apr 24, 2024 • edited Loading

Outline

python API side

Solution: side step at JS visualisation side

Summary, changes needed

msusol commented Apr 24, 2024

ed9w2in6 commented Apr 25, 2024

ed9w2in6 commented Apr 24, 2024 •

edited

Loading

`python` API side