Order of categories influences chi_square statistic #2189

ViacheslavP · 2024-12-02T08:18:05Z

Steps to reproduce

Create a simple dataset with 2:1 ration

data.sql

For some reason, I was unable to run soda with lesser number of rows

create table Employee (
                        id int primary key,
                        name varchar(255)
);

insert into Employee (id, name) values (1, 'Alice');
insert into Employee (id, name) values (2, 'Bob');
insert into Employee (id, name) values (3, 'Alice');

insert into Employee (id, name) values (11, 'Alice');
insert into Employee (id, name) values (12, 'Bob');
insert into Employee (id, name) values (13, 'Alice');

insert into Employee (id, name) values (21, 'Alice');
insert into Employee (id, name) values (22, 'Bob');
insert into Employee (id, name) values (23, 'Alice');

insert into Employee (id, name) values (31, 'Alice');
insert into Employee (id, name) values (32, 'Bob');
insert into Employee (id, name) values (33, 'Alice');

insert into Employee (id, name) values (41, 'Alice');
insert into Employee (id, name) values (42, 'Bob');
insert into Employee (id, name) values (43, 'Alice');

insert into Employee (id, name) values (51, 'Alice');
insert into Employee (id, name) values (52, 'Bob');
insert into Employee (id, name) values (53, 'Alice');

Run the following check

checks for Employee:
  - row_count = 18

  - distribution_difference(name) < 0.05:
      method: chi_square
      distribution reference file: ./distribution.yaml

with distribution.yaml:

dataset: employee
column: name
distribution_type: categorical
distribution_reference:
  weights:
  - 0.7
  - 0.3
  bins:
  - Alice
  - Bob

Expected behavior

chi_square statistic is close to zero, since the number of Alice rows is 12 and Bob's is 6

Actual behavior

the statistic value is high (~0.6)

Misc

When I change the order of weights but not the bins, the statistic is OK

The text was updated successfully, but these errors were encountered:

tools-soda · 2024-12-02T08:18:34Z

CLOUD-8980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of categories influences chi_square statistic #2189

Order of categories influences chi_square statistic #2189

ViacheslavP commented Dec 2, 2024

tools-soda commented Dec 2, 2024

Order of categories influences chi_square statistic #2189

Order of categories influences chi_square statistic #2189

Comments

ViacheslavP commented Dec 2, 2024

Steps to reproduce

Expected behavior

Actual behavior

Misc

tools-soda commented Dec 2, 2024