Binseg incorrect for small data, l2 loss and min_size=1 #242

tdhock · 2022-03-23T17:16:25Z

Hi @deepcharles thanks for putting your implementation of binary segmentation online. I am trying to use it to compute the full path of binary segmentation models, but the predict method is giving me an exception for large numbers of breakpoints, and incorrect changepoints for small numbers of breakpoints. Here is a small data example:

import ruptures as rpt
import numpy as np
rpt.version.version
data_list = [0.0, 0.1, 1.2, 1.0]
N_data = len(data_list)
data_mat = np.array(data_list).reshape(N_data,1)
algo = rpt.Binseg(model="l2",min_size=1,jump=1).fit(data_mat)
computed_break_dict={}
for n_bkps in range(N_data):
    try:
        result = algo.predict(n_bkps=n_bkps)
    except Exception as e:
        result = e
    computed_break_dict[n_bkps] = result
computed_break_dict
expected_break_dict = {
    0:[4],
    1:[2,4],
    2:[2,3,4],
    3:[1,2,3,4]
    }

In this case there are four data points. I expected the first change to be after the second data point, etc (as shown in expected_break_dict above). Instead I got the results in computed_break_dict, which are not expected:

I expected the result for n_bkps=1 to be a list of two segment ends, but I only got one.
I expected the results for n_bkps=2 or 3 to be a list of segment ends, but I got BadSegmentationParameters exceptions.
Am I using Binseg correctly, and is this a bug? If I am not using it correctly, can you please explain the correct way?
Thanks in advance!
Here is the output I got from running the code above on my system:

Python 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:11:50) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import ruptures as rpt
>>> import numpy as np
>>> rpt.version.version
'1.1.6'
>>> data_list = [0.0, 0.1, 1.2, 1.0]
>>> N_data = len(data_list)
>>> data_mat = np.array(data_list).reshape(N_data,1)
>>> algo = rpt.Binseg(model="l2",min_size=1,jump=1).fit(data_mat)
>>> computed_break_dict={}
>>> for n_bkps in range(N_data):
...     try:
...         result = algo.predict(n_bkps=n_bkps)
...     except Exception as e:
...         result = e
...     computed_break_dict[n_bkps] = result
>>> computed_break_dict
{0: [4], 1: [4], 2: BadSegmentationParameters(), 3: BadSegmentationParameters()}
>>> expected_break_dict = {
...     0:[4],
...     1:[2,4],
...     2:[2,3,4],
...     3:[1,2,3,4]
...     }

deepcharles · 2022-03-25T07:46:07Z

Hi,

Thanks for raising this issue. Here, the min_size (minimum segment length) is not handled correctly, resulting in bad segmentations without raising a BadSegmentation error. We will correct it.

oboulant · 2022-03-25T09:36:36Z

Hi @tdhock ,

Indeed, and we have to address it !

But, just so you know, if you use a min_size >= 2, then you have the expected results for the given min_size.

Here, the misbehaviour is due to the fact that, from the user perspective you think you have a min_size to 1, but in really it is overwritten by the minimum value for a given cost (here L2).
See here for the minimum value for L2 cost.
See here for the place where the set value (you set value to 1) is overwritten to 2.

Hope it helps !

tdhock · 2022-03-25T18:41:04Z

ok, but for the l2 cost (square loss change in normal mean, constant variance) the min_size can be 1, so I still expect a valid segmentation should be returned (not an exception).

oboulant · 2022-05-17T07:23:31Z

Closing the issue since it has been addressed in #255

tdhock changed the title ~~Binseg incorrect for small data~~ Binseg incorrect for small data, l2 loss and min_size=1 Mar 31, 2022

deepcharles mentioned this issue Apr 29, 2022

fix(CostL2): set min_size to 1 #255

Merged

oboulant closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binseg incorrect for small data, l2 loss and min_size=1 #242

Binseg incorrect for small data, l2 loss and min_size=1 #242

tdhock commented Mar 23, 2022

deepcharles commented Mar 25, 2022

oboulant commented Mar 25, 2022

tdhock commented Mar 25, 2022

oboulant commented May 17, 2022

Binseg incorrect for small data, l2 loss and min_size=1 #242

Binseg incorrect for small data, l2 loss and min_size=1 #242

Comments

tdhock commented Mar 23, 2022

deepcharles commented Mar 25, 2022

oboulant commented Mar 25, 2022

tdhock commented Mar 25, 2022

oboulant commented May 17, 2022