relate

Relate is an open-source python package to estimate similarities between strings (texts) using character- or token-based string similarity metrics. It takes a plain text file as input and returns a symmetric matrix of pairwise distances between the strings. This can be used for hierarchical clustering, phylogenetic or network analysis to investigate the relationships between different versions of a text.

The modules included in relate can be used together or separately.

The shingle module divides or tokenizes the texts into shingles of character or word length k, specified by a user. For more information about the shingling method, see here.

from relate import shingle
# Select the shingle length 
shingle.length = 2
# Select the shingle type: 'letters' or 'words'
shingle = shingle.select['letters']
text = 'the fox jumps' 
print(shingle(text))
['th', 'he', 'e ', ' f', 'fo', 'ox', 'x ', ' j', 'ju', 'um', 'mp', 'ps']

or

shingle = shingle.select['words']
print(shingle(text))
['the fox', 'fox jumps']

The similarity between pairs of strings is estimated by using character-based string metrics Levenshtein and Hamming similarity or token-based string metrics Jaccard similarity coefficient, Sorensen-Dice, and Overlap coefficient. Character-based metrics estimate the similarities directly from strings without using the shingling module. When applying the token-based methods, the shingling module is used, converting each string into a set and the shingles to elements of that set. For more information about the used string metrics, see here.

# An example of using the character-based metrics
from relate import metrics
text1 = 'the fox jumps'
text2 = 'the fox waits'
# Select preferable character-based metrics from two options: 'levenshtein', 'hamming'
measure = metrics.select['levenshtein']
print(measure(text1, text2))
69.23076923076923

or

# An example of using the token-based metrics
from relate import shingle, metrics
# Select the shingle length 
shingle.length = 2
# Select the shingle type: 'letters' or 'words'
metrics.shingle = shingle.select['letters']
# Select preferable token-based metrics from three options: 'jaccard', 'overlap', 'sorensen_dice'
measure = metrics.select['sorensen_dice']
print(measure(text1, text2))
58.333333

The matrixes module automatically arranges the estimated values into a symmetric matrix. All punctuation marks and capitals should be removed from the texts and the pronunciation standardized. The data should be arranged as following:

text1 = 'the fox jumps'
text2 = 'the fox waits'
text3 = 'one fox jumps'
text4 = 'second fox waits'
# Arrange all texts into two tuples:
texts1 = (text1,text2,text3,text4)
texts2 = (text1,text2,text3,text4)
# Names or IDs of the analyzed texts
matrixes.names = ('text1','text2','text3','text4')

The module returns a similarity matrix (values taken directly from the string metrics), a distance matrix (1-string metric) with or without standardizing function: estimated value - mean / standard deviation

from relate import shingle, metrics, matrixes
# Select the shingle length
shingle.length = 2
# Select the shingle type
metrics.shingle = shingle.select['letters']
# Select used similarity metrics
measure = metrics.select['sorensen_dice']
# Select the type of values used in the matrix: 'similarity', 'distance', 'st_similarity', 'st_distance'.
matrix = matrixes.select['distance']
result = matrix(measure,texts1, texts2)
print(result)
            text_1     text_2     text_3     text_4
text_1   0.000000  41.666667  16.666667  70.370370
text_2  41.666667   0.000000  58.333333  33.333333
text_3  16.666667  58.333333   0.000000  62.962963
text_4  70.370370  33.333333  62.962963   0.000000

Symmetric matrix organizes and visualizes the data in a clear manner. However, one may wish to manipulate the output. For instance, some require the upper half of the matrix only:

import numpy as np
print(np.triu(result))
[[ 0.       41.666667 16.666667 70.37037 ]
 [ 0.        0.       58.333333 33.333333]
 [ 0.        0.        0.       62.962963]
 [ 0.        0.        0.        0.      ]]

Or print the data as condensed matrix using scipy:

import scipy.spatial.distance as ssd
result = ssd.squareform(result)
print(result)
[41.666667 16.666667 70.37037  58.333333 33.333333 62.962963]

This, on the other hand, can be used as input in scipy hierarchical clustering algorithms:

from matplotlib import pyplot as plt
import scipy.cluster.hierarchy as sch

dendrogram = sch.dendrogram(sch.linkage(result, method='ward'), labels = names, leaf_font_size= 10, orientation = 'top',
                      color_threshold = 2, leaf_rotation=45)
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Data		Data
.DS_Store		.DS_Store
README.md		README.md
matrixes.py		matrixes.py
metrics.py		metrics.py
shingle.py		shingle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

relate

About

Releases

Packages

Languages

PasHyde/relate

Folders and files

Latest commit

History

Repository files navigation

relate

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages