Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

suggestion: download row ID in clusters #6

Open
yuzuhikorunrun opened this issue Jul 21, 2017 · 6 comments
Open

suggestion: download row ID in clusters #6

yuzuhikorunrun opened this issue Jul 21, 2017 · 6 comments

Comments

@yuzuhikorunrun
Copy link

Hi there,

thanks for this amazing fancy version of mapper! After working through couple datasets using km, I have few suggestions for the next update and hopefully would be helpful to others as well:

  1. in the 3D output, when we move the mouse to nodes, we only see the classification label (e.g., if the outcome is binary, then we only see 0/1), what was not generated but would be extremely helpful in later on validation of the results in traditional statistical approaches is: (number of ) row IDs within each nodes. if there is a way to see how many rows (assuming your data is one ID per row) are in each cluster, it would be really informative...

  2. after realizing that feature, maybe it would be worth adding another function from which we can select a specific cluster (assuming we have several clusters in the output) and download the row IDs in there. In this way, we can make use of the clusters generated from km and load them in logistic regression or other traditional approaches to find out what drives the separation of such clusters.

Thanks again and please let me know if you need extra clarification on this.

-Yuzu

@yuzuhikorunrun yuzuhikorunrun changed the title suggestion: # of row IDs, download data suggestion: # of row IDs, download row ID in clusters Jul 21, 2017
@yuzuhikorunrun yuzuhikorunrun changed the title suggestion: # of row IDs, download row ID in clusters suggestion: download row ID in clusters Jul 21, 2017
@MLWave
Copy link
Member

MLWave commented Jul 25, 2017

As for 2. I can do two things. Easy:

  • create an extra function: .data_from_cluster_id(id) where id is an int (or maybe list of ints) with the cluster ID you gathered from the tooltip. It returns .csv data/or maybe dataframe which you can store to_csv.

Harder:

  • create a dynamic keplermapper application that runs in the browser. Then add ability to download as .csv from the tooltips, or create lasso tools to select nodes/subset of network.

As for 1. I'll implement this soon.

As for "what drives the seperation of such clusters", I am coding up a decision tree based method to find decision rules for "random sample" negative class and "member of cluster" positive class.

I also already looked at providing statistics on the cluster compared to the entire dataset. Provide stats like: age = 3STD over dataset mean. for every cluster: http://mlwave.github.io/tda/bake.html

Any feedback on this?

@MLWave
Copy link
Member

MLWave commented Jul 25, 2017

import km

# Load digits data
from sklearn import datasets
data, labels = datasets.load_digits().data, datasets.load_digits().target

# Initialize
mapper = km.KeplerMapper(verbose=2)

# Fit and transform data
projected_data = mapper.fit_transform(data,
                                      projection=km.manifold.TSNE(random_state=1))

# Create the graph (we cluster on the projected data and suffer projection loss)
graph = mapper.map(projected_data, 
                   clusterer=km.cluster.DBSCAN(eps=0.3, min_samples=15),
                   nr_cubes=35,
                   overlap_perc=0.9)

# Create the visualizations (increased the graph_gravity for a tighter graph-look.)
mapper.visualize(graph, 
                 path_html="keplermapper_digits_ylabel_tooltips.html",
                 graph_gravity=0.25,
                 custom_tooltips=labels)

# Collect cluster data
X_cluster = mapper.data_from_cluster_id(430, graph, data)
y_cluster = mapper.data_from_cluster_id(430, graph, labels)

print(X_cluster)
print(X_cluster.shape)
print(y_cluster)
print(y_cluster.shape)
[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  0.  7. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  8.  0.  0.]
 ...,
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  2.  0.  0.]]
(24, 64)
[1 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
(24,)

@sauln
Copy link
Member

sauln commented Apr 13, 2018

We might be able to quickly build a Javascript function that could do most of this from the visualization.

The html already has all of the graph metadata, which includes index information. I could see a right-click on a node providing options to save the data, or copy it to the clipboard.

Otherwise, this kind of exploration loop would be best done inside a notebook, where the mapper is persistent.

@ghost
Copy link

ghost commented May 4, 2018

@MLWave @sauln 👍 Perhaps selecting multiple nodes using a lasso tool, and then exporting them. This feature would help in studying and understanding cluster/groups of nodes with similar features and color.

@sauln
Copy link
Member

sauln commented May 4, 2018

A lasso tool is a great idea. I've been working on a few updates to the visualize parts and will take a look at incorporating something like this.

I've been having trouble myself in trying to extract the data of multiple nodes. Going node by node can be tedious.

Do you use mapper within Jupyter or open the html in a browser?

@totport
Copy link

totport commented Mar 5, 2019

KeplerMapper is great! Definitely interested in Having a lasso tool (or other method of extracting multiple nodes) as part of the visualization.

Have there been any updates on this since last spring?

Thanks,

Jackson

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants