Skip to content

Speaker Diarization Example

Brian Cunnie edited this page Mar 5, 2016 · 3 revisions

Speaker Diarization

This is an overview of the speaker diarization application that uses the GMM specializer of PyCASP. The task of the application is to determine "who spoke when?" in an audio recording. The algorithm is based on agglomerative hierarchical clustering of GMMs using the Bayesian Information Criterion (BIC) to segment the audio feature files into speaker-homogeneous regions. Here we briefly describe the implementation in Python using the GMM specializer. For more details on the applications, please see our ASRU'11 paper.

The script for diarization is in examples/diarizer/cluster.py. After reading the config file (see below) The __main__ function creates a Diarizer object, which then creates an initial list of GMMs used for clustering. It then calls the cluster() to perform the main clustering computation. The algorithm is outlined as follows:

  1. Initialization: Train a set of GMMs, one per initial segment, using the expectation-maximization(EM) algorithm.
  2. Re-segmentation: Re-segment the audio track using majority vote over the GMMs’ likelihoods for 2.5s duration.
  3. Re-training: Retrain the GMMs on the new segmentation.
  4. Agglomeration: Select the most similar GMMs and merge them. At each iteration, the algorithm checks all possible pairs of GMMs, looking to obtain an improvement in BIC scores by merging the pair and retraining it on the pair’s combined audio segments. The GMM clusters of the pair with the largest improvement in BIC scores are permanently merged. The algorithm then repeats from the re-segmentation step until there are no remaining pairs whose merging would lead to an improved BIC score.

The script has the ability to choose between using the KL-divergence-based approximation for choosing the GMM pairs to merge, or comparing all pairs of GMMs (see paper). This setting can be specified in the config file (see below).

Finally, the script outputs two types of files, the segmentation result (in the NIST RTTM format) and the final parameters of the trained GMMs.

To call the script use regular python script execution call: python examples/diarizer/cluster.py.

Using the diarization config file

The script takes in a config file to assist in setting all the parameters for diarization. The default script name that the script takes is diarizer.cfg. You can also pass it your own config file by using the -c option: python examples/diarizer/cluster.py -c my_config.cfg. We are using the Python ConfigParser library, so the script requires the parameters in the config file to go under the [Diarizer] section tag. To display the config file settings, you can use the --help option when running the script: python examples/diarizer/cluster.py --help.

Here's an example diarizer.cfg file on a sample AMI meeting:

      [Diarizer]
      basename = IS1000a
      mfcc_feats = /AMI/featuresIS1000a_seg.feat.htk
      spnsp_file = /AMI/spnsp/IS1000a_seg.spch
      output_cluster = IS1000a.rttm
      gmm_output = IS1000a.gmm

      em_iterations = 3
      initial_clusters = 16
      M_mfcc = 5

      KL_ntop = 3
      num_seg_iters_init = 1
      num_seg_iters = 1
      seg_length = 250

Some of the parameters are required and some are optional (and have some default values):

Reqiured parameters

  • basename: meeting base name
  • mfcc_feats: HTK feature file for the audio recording
  • output_cluster: name of the output RTTM file
  • gmm_output: name of the GMMs parameters file
  • initial_clusters: number of initial clusters
  • M_mfcc: number of gaussians per model

Optional parameters

  • em_iterations: number of EM iteration for training (3 by default)
  • spnsp_file: Speech/nonspeech file
  • KL_ntop: number of GMM pairs to evaluate BIC on (0 to deactivate KL-divergency)
  • num_seg_iters_init: number of majority vote segmentation iterations for the initial phase (2 by default)
  • num_seg_iters: number of majority vote segmentation iterations for the main clustering loop (3 by default)
  • seg_length: segment length for majority vote (250 by default)