Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Overflow with large numbers in Nelson Aalen fitter #1585

Closed
shilet opened this issue Dec 16, 2023 · 7 comments
Closed

Overflow with large numbers in Nelson Aalen fitter #1585

shilet opened this issue Dec 16, 2023 · 7 comments

Comments

@shilet
Copy link

shilet commented Dec 16, 2023

When determining the confidence interval for the Nelson Aalen Fitter, the following formula is used:
def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3

The problem is that population (population = events["at_risk"] - entrances) is an integer, this should be changed to float to prevent overflow. Otherwise the resulting confidence interval is NaN.

@CamDavidsonPilon
Copy link
Owner

Hi @shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.

@shilet
Copy link
Author

shilet commented Dec 31, 2023 via email

@CamDavidsonPilon
Copy link
Owner

hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?

@shilet
Copy link
Author

shilet commented Dec 31, 2023 via email

@CamDavidsonPilon
Copy link
Owner

CamDavidsonPilon commented Jan 1, 2024

Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:

    def _variance_f_discrete(self, population, deaths):
        return (population - deaths) * deaths / population ** 3

to

    def _variance_f_discrete(self, population, deaths):
        return (1 - deaths / population) * (deaths / population) * (1. / population)

will be fixed in the next release

@shilet
Copy link
Author

shilet commented Jan 2, 2024 via email

@CamDavidsonPilon
Copy link
Owner

floats (or float64, in pandas) have the same problem: they'll overflow eventually. However, they'll overflow at a higher number. So why not use float64? While the upperbound of a float64 is higher than a int64, a float64s precision is severely reduced for larger numbers, so your final values aren't precise and could look quite silly. The implemented solution avoids computing these large numbers, so we can keep the precision of int64s while avoiding overflows.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants