Overflow with large numbers in Nelson Aalen fitter #1585

shilet · 2023-12-16T18:06:12Z

When determining the confidence interval for the Nelson Aalen Fitter, the following formula is used:
def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3

The problem is that population (population = events["at_risk"] - entrances) is an integer, this should be changed to float to prevent overflow. Otherwise the resulting confidence interval is NaN.

CamDavidsonPilon · 2023-12-30T21:02:31Z

Hi @shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.

shilet · 2023-12-31T10:41:36Z

Hi Cameron, I was just using the NelsonAalenFitter(nelson_aalen_smoothing=False), and got the following error: <lifelines.NelsonAalenFitter:"NA_estimate", fitted with 5.77905e+06 total observations, 0 right-censored observations> \lib\site-packages\lifelines\fitters\nelson_aalen_fitter.py:176: RuntimeWarning: invalid value encountered in sqrt df[ci_labels[0]] = cum_hazard_ * np.exp(-z * np.sqrt(cumulative_sq_) / np.where(cum_hazard_ == 0, 1, cum_hazard_)) The confidence interval for the cumulative hazard contains weird (way too large) numbers, and also NaN/Inf. When using a smaller dataset I didn't have this problem. The error disappears after I changed this: in utils/__init__.py: population = events["at_risk"] - entrances population = population.astype(float) So that's why I guess it is some sort of overflow issue (but my python knowledge is limited...) Hopefully this helps Best regards, Sheila

…

On Sat, 30 Dec 2023 at 22:02, Cameron Davidson-Pilon < ***@***.***> wrote: Hi @shilet <https://github.com/shilet>, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow. — Reply to this email directly, view it on GitHub <#1585 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIK4VBJWXOOZZYEX4HU327LYMB6PDAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGYYDMNJTG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CamDavidsonPilon · 2023-12-31T17:49:16Z

hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?

shilet · 2023-12-31T20:13:00Z

Hi, I can reproduce with the following: y = np.random.randint(1, 1000, 100000000) naf = NelsonAalenFitter(nelson_aalen_smoothing=False) naf.fit(y ,event_observed=None, timeline=range(0, int(y.max())))

…

On Sun, 31 Dec 2023 at 18:49, Cameron Davidson-Pilon < ***@***.***> wrote: hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)? — Reply to this email directly, view it on GitHub <#1585 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIK4VBKDV3JQ2TEFBYJ3MWTYMGQSPAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGAYDIOJRGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CamDavidsonPilon · 2024-01-01T21:15:19Z

Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:

    def _variance_f_discrete(self, population, deaths):
        return (population - deaths) * deaths / population ** 3

to

    def _variance_f_discrete(self, population, deaths):
        return (1 - deaths / population) * (deaths / population) * (1. / population)

will be fixed in the next release

shilet · 2024-01-02T08:39:47Z

Hi, Happily you could reproduce it. Is there a reason you didn't repair it by converting population to float (population = population.astype(float))?

…

On Mon, 1 Jan 2024 at 22:15, Cameron Davidson-Pilon < ***@***.***> wrote: Thanks, I was able to reproduce it with your example. I believe the problem *is* integer overflow, as you predicted. It's correct that *Python* integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent: def _variance_f_discrete(self, population, deaths): return (population - deaths) * deaths / population ** 3 to def _variance_f_discrete(self, population, deaths): return (1 - deaths / population) * (deaths / population) * (1. / population) — Reply to this email directly, view it on GitHub <#1585 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIK4VBKA6IBUG3QOD4YXF33YMMRPFAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGQ4DEOJVHE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

CamDavidsonPilon · 2024-01-02T16:11:50Z

floats (or float64, in pandas) have the same problem: they'll overflow eventually. However, they'll overflow at a higher number. So why not use float64? While the upperbound of a float64 is higher than a int64, a float64s precision is severely reduced for larger numbers, so your final values aren't precise and could look quite silly. The implemented solution avoids computing these large numbers, so we can keep the precision of int64s while avoiding overflows.

CamDavidsonPilon closed this as completed Jan 1, 2024

CamDavidsonPilon added a commit that referenced this issue Jan 1, 2024

fix int64 overflow in NAF #1585

2ea9ec8

CamDavidsonPilon mentioned this issue Jan 1, 2024

0.28.0 #1590

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow with large numbers in Nelson Aalen fitter #1585

Overflow with large numbers in Nelson Aalen fitter #1585

shilet commented Dec 16, 2023

CamDavidsonPilon commented Dec 30, 2023

shilet commented Dec 31, 2023 via email

CamDavidsonPilon commented Dec 31, 2023

shilet commented Dec 31, 2023 via email

CamDavidsonPilon commented Jan 1, 2024 •

edited

Loading

shilet commented Jan 2, 2024 via email

CamDavidsonPilon commented Jan 2, 2024

Overflow with large numbers in Nelson Aalen fitter #1585

Overflow with large numbers in Nelson Aalen fitter #1585

Comments

shilet commented Dec 16, 2023

CamDavidsonPilon commented Dec 30, 2023

shilet commented Dec 31, 2023 via email

CamDavidsonPilon commented Dec 31, 2023

shilet commented Dec 31, 2023 via email

CamDavidsonPilon commented Jan 1, 2024 • edited Loading

shilet commented Jan 2, 2024 via email

CamDavidsonPilon commented Jan 2, 2024

CamDavidsonPilon commented Jan 1, 2024 •

edited

Loading