Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Order of categories influences chi_square statistic #2189

Open
ViacheslavP opened this issue Dec 2, 2024 · 1 comment
Open

Order of categories influences chi_square statistic #2189

ViacheslavP opened this issue Dec 2, 2024 · 1 comment

Comments

@ViacheslavP
Copy link

Steps to reproduce

  1. Create a simple dataset with 2:1 ration
data.sql

For some reason, I was unable to run soda with lesser number of rows

create table Employee (
                        id int primary key,
                        name varchar(255)
);

insert into Employee (id, name) values (1, 'Alice');
insert into Employee (id, name) values (2, 'Bob');
insert into Employee (id, name) values (3, 'Alice');

insert into Employee (id, name) values (11, 'Alice');
insert into Employee (id, name) values (12, 'Bob');
insert into Employee (id, name) values (13, 'Alice');

insert into Employee (id, name) values (21, 'Alice');
insert into Employee (id, name) values (22, 'Bob');
insert into Employee (id, name) values (23, 'Alice');

insert into Employee (id, name) values (31, 'Alice');
insert into Employee (id, name) values (32, 'Bob');
insert into Employee (id, name) values (33, 'Alice');

insert into Employee (id, name) values (41, 'Alice');
insert into Employee (id, name) values (42, 'Bob');
insert into Employee (id, name) values (43, 'Alice');

insert into Employee (id, name) values (51, 'Alice');
insert into Employee (id, name) values (52, 'Bob');
insert into Employee (id, name) values (53, 'Alice');
  1. Run the following check
checks for Employee:
  - row_count = 18

  - distribution_difference(name) < 0.05:
      method: chi_square
      distribution reference file: ./distribution.yaml

with distribution.yaml:

dataset: employee
column: name
distribution_type: categorical
distribution_reference:
  weights:
  - 0.7
  - 0.3
  bins:
  - Alice
  - Bob

Expected behavior

chi_square statistic is close to zero, since the number of Alice rows is 12 and Bob's is 6

Actual behavior

the statistic value is high (~0.6)

Misc

When I change the order of weights but not the bins, the statistic is OK

@tools-soda
Copy link

CLOUD-8980

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants