-
Notifications
You must be signed in to change notification settings - Fork 120
Bag of SFA Symbols (BOSS)
The Bag-of-SFA-Symbols (BOSS) [1] classifier is a dictionary based classifier using the bag-of-words model. In a survey of Time Series Classification (TSC) algorithms [2] BOSS was found to be the best performing dictionary based algorithm at the time, and the 3rd best of all algorithms tested.
The BOSS ensemble (referred to as BOSS) is made up of a number of individual BOSS classifiers, from which majority vote is used to classify new instances. Individual BOSS classifiers transform each time series into a a number of disceretised words by running a sliding window of size w over the series. A Discrete Fourier Transform (DFT) is applied to each window, from this l letters from an alphabet of size a are extracted using bin breakpoints found from Multiple Coefficient Binning (MCB). From these histograms a 1-Nearest Neighbour classifier using a bespoke BOSS distance is used for classification. All individual BOSS classifiers with a training accuracy within 92% of the best performing classifier found through a grid-search of the parameter range are kept for the ensemble.
A more stable and efficient scheme for building the BOSS ensemble is configurable in our implementation called RBOSS [3]. RBOSS replaces the grid search with random parameter selection we introduce a new parameter, introducing parameters k, the number of classifiers to be built. The max ensemble size parameter s is then used to filter out poor peforming classifiers, replacing worst classifier if ones with a higher train accuracy is built.
In our recommended configuration of RBOSS a randomly selected 70% subsample of the training data for each individual classifier. Additionally a list of weights accuracy based exponential weighting scheme and a guided parameter selection are used. Overall these changes provide an order of magnitude speed up without loss of accuracy. Additionally the changes allow for more utility, such as the option to replace the ensemble size k with a time limit t through contracting, and the ability to save build progress using check-pointing.
The BOSS class does not have any parameters which are recommended to be changed from the default. Utilities for the classifier in outputting a train accuracy estimate and usage of multiple threads are available.
The RBOSS class has a number of adjustable parameters that will affect the classifiers output. By default the recommended settings using the best performing parameters from [3] with an improved parameter selection method will be used by default. If you wish to configure the classifier differently a list of adjustable parameters follows:
-
randomCVAccEnsemble: sets the classifier to build k classifiers with randomly selected parameters, estimating train accuracy for each and removing the lowest accuracy classifier with any better performing ones when ensemble size reaches s.
-
randomEnsembleSelection: sets the classifier to build k classifiers with randomly selected parameters.
-
ensembleSize: value for k, the number of classifiers to be built for the ensemble.
-
maxEnsembleSize: value for s, the max number of classifiers for the ensemble.
-
useWeights: whether to weight classifiers using an exponential scheme based on train accuracy.
-
reduceTrainInstances: whether to build individual classifiers using less than the total train set. Reduction amount set using trainProportion or maxTrainInstances.
-
useFastTrainEstimate: whether to reduce the number of train instances used in estimating accuracy.
-
bayesianParameterSelection: whether to estimate parameter set accuracy and select the highest estimate for each new classifier rather than randomly select.
-
contractTime: replacement for k with value t, time limit for the classifier to build with strong inclination to stopping below the limit than going over.
-
memoryLimit: memory limit for the classifier, estimates memory usage and stops being if it will perceive it going over.
Additionally a number of utilities are available for the classifier, these include:
-
trainCVPath: file path to write an estimate of classifier performance on the train set using cross validation.
-
checkpointPath: file path to write check-point files, from which the classifier can save its current progress and resume building from said point.
-
numThreads: thread allowance for multithreading.
[1] P. Schäfer. The BOSS is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505–1530, 2015.
[2] A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, 31(3):606–660, 2017.
[3] M. Middlehurst, W. Vickers and A. Bagnall. Scalable Dictionary Classifiers for Time Series Classification. ArXiv e-prints, arXiv:1907.11815, 2019.