-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Feature/sbachmei/mic 5163 exclude unwanted results #460
Feature/sbachmei/mic 5163 exclude unwanted results #460
Conversation
aa74ad9
to
a211a84
Compare
@@ -57,6 +61,7 @@ def add_stratification( | |||
name: str, | |||
sources: List[str], | |||
categories: List[str], | |||
excluded_categories: Optional[List[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a list vs a dict now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. This is a new thing and was never a dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're referring to the zoom discussion about the broken doctest?
At the highest / model spec level, excluded_categories
is a dict because you need to specify which exclusions for which set of results.
# model spec
...
configuration:
stratification:
excluded_categories:
cause_of_death:
- 'stillborn'
disability:
- 'all_causes'
- 'mild_child_wating'
But this attribute is attached to a very specific stratification (e.g. 'cause_of_death') and so is just the list to exclude.
@@ -73,6 +73,7 @@ def register_stratification( | |||
self, | |||
name: str, | |||
categories: List[str], | |||
excluded_categories: Optional[List[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per comment above, do we just convert the dict from the config into a list so now these will always be a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is at the stratfiication-level and so is always a list. The model spec also provides a list at the stratification level, it's just that each stratification is stored as the keys to a dictionary.
@@ -95,6 +95,7 @@ configuration by simply printing it. | |||
sim = get_disease_model_simulation() | |||
|
|||
del sim.configuration['input_data'] | |||
del sim.configuration['stratification']['excluded_categories'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This key was throwing errors in the expected print(sim.configuration)
test below. The actual type of sim.configuration.stratification.excluded_categories
is an emtpy LayeredConfigTree
which I just couldn't get to work (I tried None, an empty string, literally nothing, and LayeredConfigTree() and none of them worked).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually was able to get this test to pass if I set the default to None (instead of an empty dict which then gets converted to a LayeredConfigTree) but that broke a bunch of tests in vph when trying to .update
the value b/c you cannot update a ConfigNode.
@@ -86,6 +87,9 @@ def register_stratification( | |||
Name of the of the column created by the stratification. | |||
categories | |||
List of string values that the mapper is allowed to output. | |||
excluded_categories | |||
List of mapped string values to be excluded from results processing. | |||
If `None` (the default), will use exclusions as defined in th model spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo:
If `None` (the default), will use exclusions as defined in th model spec. | |
If `None` (the default), will use exclusions as defined in the configuration. |
@@ -145,6 +150,7 @@ def register_binned_stratification( | |||
------ | |||
None | |||
""" | |||
# TODO: implement excluded_categories like in `register_stratification` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be dealt with in a subsequent PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmudambi I wasn't sure how much of a priority this even is. It wasn't immediately clear to me how to handle excluded categories but then I'm also aware that we don't actually have an example of using register_binned_stratification
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to the tests
@@ -171,6 +177,9 @@ def register_stratification( | |||
Name of the of the column created by the stratification. | |||
categories | |||
List of string values that the mapper is allowed to output. | |||
excluded_categories | |||
List of mapped string values to be excluded from results processing. | |||
If `None` (the default), will use exclusions as defined in th model spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If `None` (the default), will use exclusions as defined in th model spec. | |
If `None` (the default), will use exclusions as defined in the configuration |
@@ -71,6 +76,9 @@ def add_stratification( | |||
categorization. | |||
categories | |||
List of string values that the mapper is allowed to output. | |||
excluded_categories | |||
List of mapped string values to be excluded from results processing. | |||
If `None` (the default), will use exclusions as defined in th model spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If `None` (the default), will use exclusions as defined in th model spec. | |
If `None` (the default), will use exclusions as defined in the configuration. |
@@ -71,6 +72,8 @@ def get_results(self) -> Dict[str, pd.DataFrame]: | |||
|
|||
# noinspection PyAttributeOutsideInit | |||
def setup(self, builder: "Builder") -> None: | |||
self._results_context.setup(builder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this previously not being called at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't, though there previously wasn't anything interesting there (only the logger). But now there's the excluded categories
@@ -138,6 +141,8 @@ def gather_results(self, lifecycle_phase: str, event: Event) -> None: | |||
with 0.0. | |||
""" | |||
population = self._prepare_population(event) | |||
if population.empty: | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not strictly necessary but seemed silly to pass an empty population through gather_results
CategoricalDtype(categories=self.categories, ordered=True) | ||
mapped_column = population[self.sources].apply(self.mapper, axis=1) | ||
|
||
if mapped_column.isnull().any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check needed? Shouldn't it be sufficient to do the check below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to handle the fact that NaNs are not equal and so the set of NaNs would blow up the error message. I'll try and think of a way to handle that though b/c it would be nice to check against other unknown categories as well before raising.
@@ -348,7 +348,7 @@ def get_component(self, name: str) -> Component: | |||
return self._manager.get_component(name) | |||
|
|||
def get_components_by_type( | |||
self, component_type: Union[type, Tuple[type, ...]] | |||
self, component_type: Union[type, Tuple[type, ...], list[type]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this Sequence[type]
?
@@ -364,7 +364,7 @@ def get_components_by_type( | |||
A list of components of type ``component_type``. | |||
|
|||
""" | |||
return self._manager.get_components_by_type(component_type) | |||
return self._manager.get_components_by_type(tuple(component_type)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the typing in the manager rather than converting it to a tuple here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does actually have to be a tuple though for isinstance
. I do think it's cleaner to convert it to a tuple at the very end so I've done that
@@ -61,7 +61,7 @@ def add_stratification( | |||
name: str, | |||
sources: List[str], | |||
categories: List[str], | |||
excluded_categories: Optional[List[str]], | |||
excluded_categories: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this from None to an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just seemed more consistent w/ the other list arguments defauling to [] instead of None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put it back to None b/c it conveys different information than an empty list.
# Reduce all nans to a single one | ||
single_nan_list = ( | ||
[mapped_column[mapped_column.isna()].iat[0]] | ||
if mapped_column.isna().any() | ||
else [] | ||
) | ||
unknown_categories = single_nan_list + [ | ||
cat for cat in unknown_categories if not pd.isna(cat) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be:
unknown_categories = [
cat for cat in unknown_categories if not pd.isna(cat)
]
if mapped_column.isna().any():
unknown_categories.append(np.nan)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like pd.isna wasn't working as expected but now I'm not so sure. I'll try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yeah that works and is way better. I did stick w/ appending the type of NaN since that could potentially shed light on the issue but same basic logic
b910ae9
to
c394aa7
Compare
|
||
# Optimization: We store all the producers by pop_filter and stratifications | ||
# so that we only have to apply them once each time we compute results. | ||
for (pop_filter, stratifications), observations in self.observations[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to stratififaton_names
or *_columns
or something
self, | ||
population: pd.DataFrame, | ||
pop_filter: str, | ||
stratifications: Optional[tuple[str, ...]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: rename this to stratification_names
as well.
if stratifications: | ||
pop = pop.dropna(subset=list(stratifications)) | ||
# Drop all rows in the mapped_stratification columns that have NaN values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe explain that we are dropping these rows because being NaN means that the value was previously one of the excluded categories.
def _get_groups( | ||
stratifications: Tuple[str, ...], filtered_pop: pd.DataFrame | ||
self, stratifications: Tuple[str, ...], filtered_pop: pd.DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can still be static? I don't see self
being used.
|
||
def test_register_stratification(mocker): | ||
def _silly_mapper(): | ||
return "foo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the mapper return values in {"some-category", "some-other-category", "some-unwanted-category"}
? I know we're not testing that in this test, but it is a bit confusing.
`categories` is a set of values that the mapper is allowed to output. The | ||
`mapper` is the method that transforms the source to the name column. | ||
The method produces an output column by calling the mapper on the source | ||
The `Stratification` class has six fields: `name`, `sources`, `mapper`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this PR only adds to what existed already, but if we're going to go through each attribute one by one, we should probably use a systematic docstring format i.e. what we use for say, Builder
https://github.com/ihmeuw/vivarium/blob/main/src/vivarium/framework/engine.py
I'd say "X class has # fields" is also unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, and I'll do all that doc stuff in the final PRs for this epic. For now I only modified docs if they were categorically incorrect
…egories at debug level
…_binned_stratification
Co-authored-by: patricktnast <130876799+patricktnast@users.noreply.github.com>
fa615d2
to
8c71ae0
Compare
* implement exclusions * handle name collisions when stratifying * allow component_type to be any sequence * Add tests for stratification registration through interface
Implement excluded results via model spec
Description
Changes and notes
This implements the ability to exclude results via the model specwith a new
stratification: excluded_categories
key.The work got a bit bigger than expected b/c we had to handle (1) typical
results that are actual Stratification objects (2) non-typical results that
are not Stratification objects but get handled on a case-by-case bases
(e.g. ylds which you cannot stratify by cause b/c you can have multiple per
person), and (3) results that have an extra unique category that are
not explicitly requested but come "for free" (e.g. all_causes for deaths
and other_causes for disability).
At a high level, what we're doing is modifying what is happening during
ResultsContext.gather_results():
drop them from the column mapping during each Stratification.call
from observations, we _filter_population to both query the defined
pop_filter and also remove any rows that have NaNs in
columns we are stratifying by.
Testing
tests pass and I ran some small simulations against the maternal
and child models.