-
Notifications
You must be signed in to change notification settings - Fork 19
Speaker Diarization Example
This is an overview of the speaker diarization application that uses the GMM specializer of PyCASP. The task of the application is to determine "who spoke when?" in an audio recording. The algorithm is based on agglomerative hierarchical clustering of GMMs using the Bayesian Information Criterion (BIC) to segment the audio feature files into speaker-homogeneous regions. Here we briefly describe the implementation in Python using the GMM specializer. For more details on the applications, please see our ASRU'11 paper.
The script for diarization is in examples/diarizer/cluster.py
. After reading the config file (see below) The __main__
function creates a Diarizer
object, which then creates an initial list of GMMs used for clustering. It then calls the cluster()
to perform the main clustering computation. The algorithm is outlined as follows:
- Initialization: Train a set of GMMs, one per initial segment, using the expectation-maximization(EM) algorithm.
- Re-segmentation: Re-segment the audio track using majority vote over the GMMs’ likelihoods for 2.5s duration.
- Re-training: Retrain the GMMs on the new segmentation.
- Agglomeration: Select the most similar GMMs and merge them. At each iteration, the algorithm checks all possible pairs of GMMs, looking to obtain an improvement in BIC scores by merging the pair and retraining it on the pair’s combined audio segments. The GMM clusters of the pair with the largest improvement in BIC scores are permanently merged. The algorithm then repeats from the re-segmentation step until there are no remaining pairs whose merging would lead to an improved BIC score.
The script has the ability to choose between using the KL-divergence-based approximation for choosing the GMM pairs to merge, or comparing all pairs of GMMs (see paper). This setting can be specified in the config file (see below).
Finally, the script outputs two types of files, the segmentation result (in the NIST RTTM format) and the final parameters of the trained GMMs.
To call the script use regular python script execution call: python examples/diarizer/cluster.py
.
Using the diarization config file
The script takes in a config file to assist in setting all the parameters for diarization. The default script name that the script takes is diarizer.cfg
. You can also pass it your own config file by using the -c
option: python examples/diarizer/cluster.py -c my_config.cfg
. We are using the Python ConfigParser library, so the script requires the parameters in the config file to go under the [Diarizer]
section tag. To display the config file settings, you can use the --help
option when running the script: python examples/diarizer/cluster.py --help
.
Here's an example diarizer.cfg
file on a sample AMI meeting:
[Diarizer]
basename = IS1000a
mfcc_feats = /AMI/featuresIS1000a_seg.feat.htk
spnsp_file = /AMI/spnsp/IS1000a_seg.spch
output_cluster = IS1000a.rttm
gmm_output = IS1000a.gmm
em_iterations = 3
initial_clusters = 16
M_mfcc = 5
KL_ntop = 3
num_seg_iters_init = 1
num_seg_iters = 1
seg_length = 250
Some of the parameters are required and some are optional (and have some default values):
Reqiured parameters
- basename: meeting base name
- mfcc_feats: HTK feature file for the audio recording
- output_cluster: name of the output RTTM file
- gmm_output: name of the GMMs parameters file
- initial_clusters: number of initial clusters
- M_mfcc: number of gaussians per model
Optional parameters
- em_iterations: number of EM iteration for training (3 by default)
- spnsp_file: Speech/nonspeech file
- KL_ntop: number of GMM pairs to evaluate BIC on (0 to deactivate KL-divergency)
- num_seg_iters_init: number of majority vote segmentation iterations for the initial phase (2 by default)
- num_seg_iters: number of majority vote segmentation iterations for the main clustering loop (3 by default)
- seg_length: segment length for majority vote (250 by default)