Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add metric for general MSAS statistics #639

Open
npatki opened this issue Oct 9, 2024 · 0 comments · Fixed by #649
Open

Add metric for general MSAS statistics #639

npatki opened this issue Oct 9, 2024 · 0 comments · Fixed by #649
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Oct 9, 2024

Problem Description

In this paper, we introduced a new methodology for calculating multi-sequence metrics called MSAS. We should add the MSAS-related metrics to SDMetrics so that users with sequential data can use them for evaluation.

Expected behavior

Add a metric called StatisticMSAS that performs the MSAS algorithm for a given statistic.

Data compatibility: 1 ID column (representing the sequence key), and 1 continuous column (datetime or numerical)

Parameters:

  • (required) real_data: A tuple of 2 pandas.Series objects. The first represents the sequence key of the real data and the second represents a continuous column of data.
  • (required) synthetic_data: A tuple of 2 pandas.Series objects. The first represents the sequence key of the synthetic data and the second represents a continuous column of data.
  • statistic: A string representing the statistic function to use when computing MSAS
    • (default) 'mean': The arithmetic mean
    • 'median': The median value
    • 'std': The standard deviation
    • 'min': The min value
    • 'max': The max value

Output: A score in range [0, 1] -- 0 being the worst and 1 being the best

from sdmetrics.column_pairs import StatisticMSAS

score = StatisticMSAS.compute(
  real_data=(real_table['patient_id'], real_table['heart_rate']),
  synthetic_data = (synthetic_table['patient_id'], synthetic_table['heart_rate']),
  statistic='std'
)

How does it work? The sequence key determines which continuous values belong to which sequence. This metric computes a statistic for all sequences in the real and synthetic data, and then compares those distributions.

  1. Calculate the statistic value of each sequence in the real data (call this distribution D_r)
  2. Calculate the statistic value of each sequence in the synthetic data (call this distribution D_s)
  3. Now apply the KSComplement metric to compare the similarities of the distributions (D_r, D_s). Return this score.
@npatki npatki added feature request Request for a new feature data:sequential Related to timeseries datasets labels Oct 9, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
data:sequential Related to timeseries datasets feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant