Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Feature/sbachmei/mic 5163 exclude unwanted results #460

Conversation

stevebachmeier
Copy link
Contributor

Implement excluded results via model spec

Description

Changes and notes

This implements the ability to exclude results via the model spec

with a new stratification: excluded_categories key.

The work got a bit bigger than expected b/c we had to handle (1) typical
results that are actual Stratification objects (2) non-typical results that
are not Stratification objects but get handled on a case-by-case bases
(e.g. ylds which you cannot stratify by cause b/c you can have multiple per
person), and (3) results that have an extra unique category that are
not explicitly requested but come "for free" (e.g. all_causes for deaths
and other_causes for disability).

At a high level, what we're doing is modifying what is happening during
ResultsContext.gather_results():

  • We extract any excluded categories for a given result measure and
    drop them from the column mapping during each Stratification.call
  • Then, for each (pop_filter, stratfiications) tuple that gets registered
    from observations, we _filter_population to both query the defined
    pop_filter and also remove any rows that have NaNs in
    columns we are stratifying by.

Testing

tests pass and I ran some small simulations against the maternal
and child models.

@stevebachmeier stevebachmeier force-pushed the feature/sbachmei/MIC-5163-exclude-unwanted-results branch from aa74ad9 to a211a84 Compare August 7, 2024 22:15
@@ -57,6 +61,7 @@ def add_stratification(
name: str,
sources: List[str],
categories: List[str],
excluded_categories: Optional[List[str]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a list vs a dict now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. This is a new thing and was never a dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're referring to the zoom discussion about the broken doctest?

At the highest / model spec level, excluded_categories is a dict because you need to specify which exclusions for which set of results.

# model spec
...
configuration:
  stratification:
    excluded_categories:
      cause_of_death:
        - 'stillborn'
      disability:
        - 'all_causes'
        - 'mild_child_wating'

But this attribute is attached to a very specific stratification (e.g. 'cause_of_death') and so is just the list to exclude.

@@ -73,6 +73,7 @@ def register_stratification(
self,
name: str,
categories: List[str],
excluded_categories: Optional[List[str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per comment above, do we just convert the dict from the config into a list so now these will always be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at the stratfiication-level and so is always a list. The model spec also provides a list at the stratification level, it's just that each stratification is stored as the keys to a dictionary.

@@ -95,6 +95,7 @@ configuration by simply printing it.
sim = get_disease_model_simulation()

del sim.configuration['input_data']
del sim.configuration['stratification']['excluded_categories']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key was throwing errors in the expected print(sim.configuration) test below. The actual type of sim.configuration.stratification.excluded_categories is an emtpy LayeredConfigTree which I just couldn't get to work (I tried None, an empty string, literally nothing, and LayeredConfigTree() and none of them worked).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually was able to get this test to pass if I set the default to None (instead of an empty dict which then gets converted to a LayeredConfigTree) but that broke a bunch of tests in vph when trying to .update the value b/c you cannot update a ConfigNode.

@@ -86,6 +87,9 @@ def register_stratification(
Name of the of the column created by the stratification.
categories
List of string values that the mapper is allowed to output.
excluded_categories
List of mapped string values to be excluded from results processing.
If `None` (the default), will use exclusions as defined in th model spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
If `None` (the default), will use exclusions as defined in th model spec.
If `None` (the default), will use exclusions as defined in the configuration.

@@ -145,6 +150,7 @@ def register_binned_stratification(
------
None
"""
# TODO: implement excluded_categories like in `register_stratification`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be dealt with in a subsequent PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmudambi I wasn't sure how much of a priority this even is. It wasn't immediately clear to me how to handle excluded categories but then I'm also aware that we don't actually have an example of using register_binned_stratification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to the tests

@@ -171,6 +177,9 @@ def register_stratification(
Name of the of the column created by the stratification.
categories
List of string values that the mapper is allowed to output.
excluded_categories
List of mapped string values to be excluded from results processing.
If `None` (the default), will use exclusions as defined in th model spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `None` (the default), will use exclusions as defined in th model spec.
If `None` (the default), will use exclusions as defined in the configuration

@@ -71,6 +76,9 @@ def add_stratification(
categorization.
categories
List of string values that the mapper is allowed to output.
excluded_categories
List of mapped string values to be excluded from results processing.
If `None` (the default), will use exclusions as defined in th model spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `None` (the default), will use exclusions as defined in th model spec.
If `None` (the default), will use exclusions as defined in the configuration.

@@ -71,6 +72,8 @@ def get_results(self) -> Dict[str, pd.DataFrame]:

# noinspection PyAttributeOutsideInit
def setup(self, builder: "Builder") -> None:
self._results_context.setup(builder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this previously not being called at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't, though there previously wasn't anything interesting there (only the logger). But now there's the excluded categories

@@ -138,6 +141,8 @@ def gather_results(self, lifecycle_phase: str, event: Event) -> None:
with 0.0.
"""
population = self._prepare_population(event)
if population.empty:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not strictly necessary but seemed silly to pass an empty population through gather_results

CategoricalDtype(categories=self.categories, ordered=True)
mapped_column = population[self.sources].apply(self.mapper, axis=1)

if mapped_column.isnull().any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check needed? Shouldn't it be sufficient to do the check below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to handle the fact that NaNs are not equal and so the set of NaNs would blow up the error message. I'll try and think of a way to handle that though b/c it would be nice to check against other unknown categories as well before raising.

@@ -348,7 +348,7 @@ def get_component(self, name: str) -> Component:
return self._manager.get_component(name)

def get_components_by_type(
self, component_type: Union[type, Tuple[type, ...]]
self, component_type: Union[type, Tuple[type, ...], list[type]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this Sequence[type]?

@@ -364,7 +364,7 @@ def get_components_by_type(
A list of components of type ``component_type``.

"""
return self._manager.get_components_by_type(component_type)
return self._manager.get_components_by_type(tuple(component_type))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the typing in the manager rather than converting it to a tuple here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does actually have to be a tuple though for isinstance. I do think it's cleaner to convert it to a tuple at the very end so I've done that

@@ -61,7 +61,7 @@ def add_stratification(
name: str,
sources: List[str],
categories: List[str],
excluded_categories: Optional[List[str]],
excluded_categories: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this from None to an empty list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just seemed more consistent w/ the other list arguments defauling to [] instead of None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it back to None b/c it conveys different information than an empty list.

Comment on lines 78 to 86
# Reduce all nans to a single one
single_nan_list = (
[mapped_column[mapped_column.isna()].iat[0]]
if mapped_column.isna().any()
else []
)
unknown_categories = single_nan_list + [
cat for cat in unknown_categories if not pd.isna(cat)
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't this be:

unknown_categories = [
    cat for cat in unknown_categories if not pd.isna(cat)
]
if mapped_column.isna().any():
    unknown_categories.append(np.nan)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like pd.isna wasn't working as expected but now I'm not so sure. I'll try again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yeah that works and is way better. I did stick w/ appending the type of NaN since that could potentially shed light on the issue but same basic logic

@stevebachmeier stevebachmeier force-pushed the feature/sbachmei/MIC-5163-exclude-unwanted-results branch from b910ae9 to c394aa7 Compare August 8, 2024 23:24

# Optimization: We store all the producers by pop_filter and stratifications
# so that we only have to apply them once each time we compute results.
for (pop_filter, stratifications), observations in self.observations[
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to stratififaton_names or *_columns or something

self,
population: pd.DataFrame,
pop_filter: str,
stratifications: Optional[tuple[str, ...]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename this to stratification_names as well.

if stratifications:
pop = pop.dropna(subset=list(stratifications))
# Drop all rows in the mapped_stratification columns that have NaN values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe explain that we are dropping these rows because being NaN means that the value was previously one of the excluded categories.

def _get_groups(
stratifications: Tuple[str, ...], filtered_pop: pd.DataFrame
self, stratifications: Tuple[str, ...], filtered_pop: pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this can still be static? I don't see self being used.


def test_register_stratification(mocker):
def _silly_mapper():
return "foo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the mapper return values in {"some-category", "some-other-category", "some-unwanted-category"}? I know we're not testing that in this test, but it is a bit confusing.

src/vivarium/framework/results/context.py Outdated Show resolved Hide resolved
`categories` is a set of values that the mapper is allowed to output. The
`mapper` is the method that transforms the source to the name column.
The method produces an output column by calling the mapper on the source
The `Stratification` class has six fields: `name`, `sources`, `mapper`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this PR only adds to what existed already, but if we're going to go through each attribute one by one, we should probably use a systematic docstring format i.e. what we use for say, Builder
https://github.com/ihmeuw/vivarium/blob/main/src/vivarium/framework/engine.py
I'd say "X class has # fields" is also unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and I'll do all that doc stuff in the final PRs for this epic. For now I only modified docs if they were categorically incorrect

@stevebachmeier stevebachmeier force-pushed the feature/sbachmei/MIC-5163-exclude-unwanted-results branch from fa615d2 to 8c71ae0 Compare August 12, 2024 20:06
@stevebachmeier stevebachmeier merged commit 2dfe6c8 into release-candidate-spring Aug 12, 2024
6 checks passed
@stevebachmeier stevebachmeier deleted the feature/sbachmei/MIC-5163-exclude-unwanted-results branch August 12, 2024 20:10
stevebachmeier added a commit that referenced this pull request Aug 13, 2024
* implement exclusions
* handle name collisions when stratifying
* allow component_type to be any sequence
* Add tests for stratification registration through interface
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants