Skip to content

Fast but accurate approximation of Ward's agglomerative clustering using a fully connected TSP graph

License

Notifications You must be signed in to change notification settings

uef-machine-learning/tspgclu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TSPg

The TSPg software implements the following algorithm that approximates Ward's agglomerative clustering:

Sieranoja, S., Fränti, P. Fast agglomerative clustering using approximate traveling salesman solutions. Journal of Big Data 12, 21 (2025). https://doi.org/10.1186/s40537-024-01053-x

The software is provided with GNU Lesser General Public License, Version 3. https://www.gnu.org/licenses/lgpl.html. All files except those under data and contrib -folders are subject to this license. See LICENSE.txt.

Contact: samisi@cs.uef.fi

Please, let me know how the software works for you if you try it out.

Python interface

Install & test

git clone https://github.com/uef-machine-learning/tspgclu.git
cd tspgclu
pip install -r requirements.txt
pip install .
python python/ex_cluster.py

Examples

See ex_* files in directory python/

Ward's agglomerative clustering using the TSP graph

(see file python/ex_cluster.py)

import numpy as np
import matplotlib.pyplot as plt

import tspg

def show_clusters_2d(x,labels,numclu):
	colormap = plt.cm.gist_ncar
	colorst = [colormap(i) for i in np.linspace(0, 0.9,numclu)]
	# print(colorst)
	u_labels = np.unique(labels)
	for i in u_labels:
		plt.scatter(x[labels == i , 0] , x[labels == i , 1] , label = i, color = colorst[i-1])
	plt.show()

# Fast version using built in distance functions written in C++:
def example_vec(ds,numclu):
	# For higher quality:
	#  - increase number of tsp paths (num_tsp), (in range [2,100])
	# Needs ds input in python list format
	labels,mergeOrder = tspg.tspg(ds.tolist(),numclu,distance="l2",num_tsp=5,dtype="vec")
	
	show_clusters_2d(ds,labels,numclu)

x=np.loadtxt('data/s1.txt')
example_vec(x,15)

clustering results

Showing the dendogram

See python/ex_dendogram.py

...
ds = np.genfromtxt('data/s1_small.txt')
labels,mergeOrder = tspg.tspg(ds.tolist(),1,distance="l2",num_tsp=5,dtype="vec")

mergeOrder_scipy = mergeOrderToScipyFormat(mergeOrder)

fig, axs = plt.subplots(1, 2, figsize=(12, 6), gridspec_kw={'width_ratios': [1, 1]})

plt1 = axs[0]; plt2 = axs[1]

# Create 2d plot showing the  merges
plt1.scatter(ds[:, 0], ds[:, 1], marker='o', color='b')
for pair in mergeOrder:
	plt1.plot([ds[pair[0],0], ds[pair[1],0]] , [ds[pair[0],1], ds[pair[1],1]], 'k-')
plt1.set_title('Merge Order')

# Plot the dendrogram
dendrogram(mergeOrder_scipy,ax=plt2,no_labels=True)
plt2.set_title('Hierarchical Clustering Dendrogram')
plt.show()

dendogram

Getting the TSP graph

(file python/ex_create_graph.py)

import numpy as np
import matplotlib.pyplot as plt
import tspg
x=np.loadtxt('data/s1_small.txt')

# the graph is represented as num_tsp different linear orderings between the data x
paths = tspg.create_graph(x.tolist(),distance="l2",num_tsp=4)

plt.figure(figsize=(6, 6))
plt.scatter(x[:, 0], x[:, 1], marker='o', color='b')

for tsp in paths:
	for i in range(0,len(tsp)-1):
		# Draw line betwen consequtive points according the order in tsp  
		plt.plot([x[tsp[i],0], x[tsp[i+1],0]] , [x[tsp[i],1], x[tsp[i+1],1]], 'k-')

plt.title("The TSP graph")
plt.show()

tsp graph

Commandline interface

Compile

make clean
make

Use cases

Numerical (vectorial) data

Cluster vectorial (numerical) S1 dataset (data/s1.txt) to 15 clusters:

./tspg data/s1.txt --type vec -T 10  -C 15 --algo tspgclu -o tmp/s1part.txt  --cfn tmp/centroids.txt

For string data

./tspg  data/birkbeckU.txt --type txt -T 10 -C 20 --algo tspgclu --dfun lev -o tmp/birkbeckU_part.txt -H
./show_text_clusters.rb data/birkbeckU.txt tmp/birkbeckU_part.txt > tmp/text_clu.html

About

Fast but accurate approximation of Ward's agglomerative clustering using a fully connected TSP graph

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published