Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The ColumnSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) #249

Closed
npatki opened this issue Jun 5, 2023 · 0 comments · Fixed by #374
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@npatki
Copy link

npatki commented Jun 5, 2023

Environment Details

  • SDGym version: 0.6.0 (latest)

What is expected

The ColumnSynthesizer is expected to independently model each column.

  • For numerical or datetime sdtypes, it should learn a univariate GMM during fit. Then during sample, it can create data from it.
  • For categorical or boolean sdtypes, it should learn the frequencies of each category. Then during sample, it can create data using those frequencies as weights.
  • For other sdtypes (such as id, pii, etc.), it can simply use the RegexGenerator or AnonymizedFaker to generate values from scratch (no learning is expected)

How does this synthesizer know which type is which? It should use the provided metadata as the ground source of truth.

What is actually observed

Similar to the UniformSynthesizer (see #248), this synthesizer just lets the RDT HyperTransformer decide which column is which sdtype (based on the data).

It should be referencing the metadata, since the metadata is the source of truth.

@npatki npatki added the bug Something isn't working label Jun 5, 2023
@npatki npatki changed the title The IndependentSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) The ColumnSynthesizer should follow the sdtypes in the metadata (not the data's dtypes) Jan 8, 2025
@fealho fealho self-assigned this Jan 19, 2025
@fealho fealho added this to the 0.9.2 milestone Jan 21, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants