-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathngram_pattern_search.py
265 lines (199 loc) · 12.9 KB
/
ngram_pattern_search.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
"""
ngram_pattern_search.py contains NgramSimilarity class, which reads the output of setup_ngrams_tfidf.main(), a table
containing:
[rows] All unique n-gram patterns for a given musical feature and range of pattern lengths,
which occur at least once in a given feature sequence format music corpus.
[cols]: tf-idf values for each piece of music in the corpus; plus a single corpus-level idf column.
EG: All accent-level 'pitch_class' feature patterns of 5-10 note-events in length, which occur at least once, extracted
from the MIDI Ceol Rince na hEireann corpus at https://github.com/danDiamo/music_pattern_analysis/tree/master/corpus.
Such a table allows us read tf-idf value per n-gram per piece of music, and, from this input data,
NgramSimilarity can perform the following tasks:
-Extract most frequent n-gram(s) (as ranked by tf-idf) from a single piece of music.
-Find all similar patterns across the corpus via Damerau-Levenshtein distance.
-[Work-in-progress] Identify pieces of music which contain the highest number of similar patterns to the candidate.
"""
from fastDamerauLevenshtein import damerauLevenshtein
import pandas as pd
pd.options.mode.chained_assignment = None
class NgramSimilarity:
"""
An NgramSimilarity object is instantiated by a single argument, 'inpath', which is the path to a data table of
n-gram patterns, tf-idf, and idf values, in feather format.
Attributes:
input_corpus -- input data file read to pandas Dataframe.
title -- title of search candidate tune under investigation.
Assigned in main() and passed to NgramSimilarity.extract_candidate_ngrams().
candidate_ngrams -- empty attribute, to hold a dict of the most frequent ngram pattern(s) of length n for a
single candidate piece of music. Attr is populated via NgramSimilarity.extract_candidate_ngrams() method.
n -- length of candidate pattern(s). Assigned in main() and passed to NgramSimilarity.extract_candidate_ngrams()
test_corpus -- empty attribute, to hold a filtered version of input_corpus Dataframe, containing only data for
n-gram patterns with length equal to length of candidate pattern ±1. Attr is populated via
NgramSimilarity.setup_test_corpus() method.
ngram_similarity_results -- empty attribute, to hold a filtered version of test_corpus Dataframe, retaining
similar n-gram patterns to the candidate pattern(s)as measured via Damerau-Levenshtein edit
distance. Attr is populated via NgramSimilarity.find_similar_patterns() method.
results -- [work-in-progress] A Dataframe containing two columns, output of
NgramSimilarity.find_similar_tunes() method:
[1] 'title: titles of all pieces of music in the corpus which contain at least one of
the similar patterns held in in NgramSimilarity.ngram_similarity_results.
[2] 'count': the number of similar n-gram patterns which occur in the tune.
Current work-in-progress involves evaluating the viability of this table, sorted by count, as a simple
indicator of melodic similarity. Formal evaluation and additional metrics will be added as work continues.
"""
def __init__(self, inpath):
raw_input = pd.read_feather(inpath)
# convert n-gram arrays to tuples
raw_input['ngram'] = [tuple(ngram) for ngram in raw_input['ngram']]
self.input_corpus = raw_input
self.title = None
self.candidate_ngrams = None
self.n = None
self.test_corpus = None
self.ngram_similarity_results = None
self.results = None
def extract_candidate_ngrams(self, title, n=None, mode=None, indices=None):
# TODO: Add exceptions for if blocks.
"""
Extracts frequent n-gram pattern(s) of length 'n' items for a target piece of music.
Stores results in self.candidate_ngrams dict, formatted per: tune_title [key]: candidate_ngrams [value].
Args:
title -- title of the candidate tune under investigation. This is equivalent to the filename of the feature
sequence csv representation of the tune, with the '.csv' suffix removed.
n -- length of candidate pattern(s).
EG: n = 6: extracts the most frequent pattern(s) of 6 items in length from the piece of music named in 'title'.
mode -- either 'tfidf' or 'idx'.
If mode=='max_tfidf' the method will return all ngram pattern(s) in the target piece of music with the maximum
tf-ifd value.
If mode=='idx' the method returns n-gram patterns for a specific range of indices, as ranked by tf-idf.
The index values are passed as a list to 'indices' parameter.
indices -- A list of integer column index values. EG: for 'indices'=[0, 1], the top two n-gram patterns as
ranked by tf-idf will be extracted..
NOTE: This method will be augmented with n-gram pattern clustering functionality in future work.
"""
self.n, self.title = n, title
for col_name in self.input_corpus.columns:
if self.title in col_name:
target_col = col_name
# retain only rows for n-grams of length n:
filtered_corpus = self.input_corpus[self.input_corpus['ngram'].apply(lambda x: len(x) == self.n)]
# extract n-gram(s) of max tf-idf value from column in corpus Dataframe corresponding to search candidate tune:
if mode == 'max_tfidf':
freq_ngrams = filtered_corpus[['ngram', target_col]][filtered_corpus[target_col] ==
filtered_corpus[target_col].max()]
elif mode == 'idx':
filtered_corpus.sort_values(by=[target_col], ascending=False, inplace=True)
filtered_corpus.reset_index(inplace=True, drop=True)
print(filtered_corpus.head())
freq_ngrams = filtered_corpus[filtered_corpus.index.isin(indices)]
self.candidate_ngrams = freq_ngrams
print(f"\nFrequent n-gram pattern(s) extracted from {title}:")
print(self.candidate_ngrams)
return self.candidate_ngrams
def setup_test_corpus(self):
"""
This method filters an NgramCorpus.corpus Dataframe, retaining only n-gram patterns of length n±1.
This is to allow the Damerau-Levenshtein edit distance algorithmsdetect similar patterns even if they are of
slightly different length to the candidate.
EG: When we look for similar patterns for a given pitch class 6-gram, we extract all 5, 6, and 7-grams using
this method, and then search them via NgramSimilarity.find_similar_patterns().
"""
print("\nFiltering corpus n-grams...")
self.test_corpus = self.input_corpus[self.input_corpus['ngram'].apply(lambda x: self.n-1 <= len(x) <= self.n+1)]
print("Corpus n-gram filtering complete.")
return self.test_corpus
def find_similar_patterns(self, edit_dist_threshold=None):
"""
This method runs Robert Grigoriou's fastDamerauLevenshtein implementation of the Damerau Levenshtein edit
distance algorithm, with default weights (a penalty value of 1 for replacement, deletion, swapping or insertion
of items in sequence)
Pairwise similarity is calculated for each of the candidate n-gram(s) held in
NgramSimilarity.candidate_ngrams dict vs each of the n-grams in NgramSimilarity.test_corpus Dataframe.
The results are filtered according to an integer value, as passed to 'edit_dist_threshold' arg:
only n-grams with an edit distance less than or equal to the threshold value are retained, and they are returned
to NgramSimilarity.ngram_similarity_results Dataframe.
"""
print("\nSearching corpus for similar n-gram patterns...")
candidates = [candidate for candidate in self.candidate_ngrams['ngram']]
distances = self.test_corpus[['ngram']].copy()
# Calculate Damerau-Levenshtein distances:
for candidate in candidates:
edit_dists = [damerauLevenshtein(candidate, ngram, similarity=False) for ngram in distances['ngram']]
distances[f'{candidate}'] = edit_dists
# Print report:
distances.set_index('ngram', inplace=True)
# Filter out rows which do not have any values <= threshold:
retained = distances[(distances <= edit_dist_threshold).any(1)]
retained.reset_index(inplace=True)
# Print report & return results:
print(f"{len(retained)} Similar patterns detected:")
print(retained.head())
self.ngram_similarity_results = retained
return self.ngram_similarity_results
def find_similar_tunes(self):
"""
This method filters NgramSimilarity.test_corpus Dataframe, firstly retaining only rows for n-grams matching
those in NgramSimilarity.ngram_similarity_results. It then filters the Dataframe columns, retaining only those
containing any non-zero values.
After filtering, only pieces of music (i.e.: columns) which themselves contain at least one of the similar
patterns held in NgramSimilarity.ngram_similarity_results are retained.
The number of similar patterns contained by each piece of music is counted, and the results are sorted by count.
Work-in-progress involves evaluation of the efficacy of this count as an indicator of similarity with the
candidate piece of music.
"""
print("Searching corpus for similar tunes...")
# filter NgramSimilarity.test_corpus, retaining only rows for similar n-grams:
lookup_df = self.test_corpus[self.test_corpus['ngram'].isin(self.ngram_similarity_results['ngram'])]
lookup_df.drop(['ngram', 'idf'], axis=1, inplace=True)
# drop all zero-value columns (i.e.: all tunes in which no similar patterns occur):
lookup_df = lookup_df.loc[:, (lookup_df != 0).any(axis=0)]
# count non-zero values in all remaining columns (i.e.: for all tunes with at least one similar pattern,
# return the number of similar n-gram patterns which occur in the tune).
# NOTE: The frequency of occurrences of each similar n-gram per tune is not yet included in calculations.
# This will be the first addition for the second draft of deliverable D3.1.
count = lookup_df.astype(bool).sum(axis=0)
results = count.to_frame().reset_index()
results.columns = ['title', 'count']
results['title'] = results['title'].str.slice(0, -7)
results.sort_values(by=['count'], ascending=False, inplace=True)
print(f"Similarity results for {self.title}:")
print(results.head())
results.to_csv(f"/Users/dannydiamond/NUIG/Polifonia/CRE_clean/testing/{self.title}.csv")
self.results = results
def main():
"""
Initializes NgramSimilarity object by passing the path to a data table of n-gram patterns, tf-idf, and idf values,
in feather format, as outputted by setup_ngrams_tfidf.main().
With an NgramSimilarity object set up, we call:
NgramSimilarity.extract_candidate_ngrams()
Extracts n-gram patterns from candidate tune under investigation.
Args:
'title' -- title of candidate piece of music per the filename of the original MIDI file.
'n' -- pattern length.
'mode' -- If 'tfidf', this method call will extract the most frequent n-gram pattern of length n from the
candidate tune, as ranked by tf-idf. If 'idx', the cal will extract the n-grams at indices listed in
'indices' arg.
indices -- list of integers corresponding to NgramSimilarity.test_corpus dataframe indices.
NgramSimilarity.setup_test_corpus()
Filters corpus Dataframe to include patterns of length n±1.
NgramSimilarity.find_similar_patterns()
Finds similar patterns to candidate pattern via Damerau-Levenshtein edit distance.
arg: 'edit_dist_threshold' = maximum number of edits permitted in Damerau-Levenshtein algorithm.
NOTE: default edit_dist_threshold=2, but for pieces of music with multiple frequent patterns, reducing this value to
1 can prove more effective.
NgramSimilarity.find_similar_tunes()
Filters the corpus for pieces of music which include similar patterns to candidate.
Counts occurrences of similar patterns per piece of music and returns ranked table.
Saves table to csv.
:returns: NgramSimilarity.results
This is an initial work-in-progress metric of similarity between the search candidate and other pieces of music in
the corpus. Evaluation and work on additional metrics is ongoing.
"""
# TODO: Add CLI to allow modification of args.
inpath = "/Users/dannydiamond/NUIG/Polifonia/CRE_clean/ngrams/cre_pitch_class_accents_tfidf.ftr"
pattern_search = NgramSimilarity(inpath)
pattern_search.extract_candidate_ngrams("Lord McDonald's (reel)", n=6, mode='idx', indices=[0, 1])
pattern_search.setup_test_corpus()
pattern_search.find_similar_patterns(edit_dist_threshold=1)
pattern_search.find_similar_tunes()
return pattern_search.results
main()