-
-
Notifications
You must be signed in to change notification settings - Fork 560
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Overflow with large numbers in Nelson Aalen fitter #1585
Comments
Hi @shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow. |
Hi Cameron,
I was just using the NelsonAalenFitter(nelson_aalen_smoothing=False), and
got the following error:
<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 5.77905e+06 total
observations, 0 right-censored observations>
\lib\site-packages\lifelines\fitters\nelson_aalen_fitter.py:176:
RuntimeWarning: invalid value encountered in sqrt
df[ci_labels[0]] = cum_hazard_ * np.exp(-z * np.sqrt(cumulative_sq_) /
np.where(cum_hazard_ == 0, 1, cum_hazard_))
The confidence interval for the cumulative hazard contains weird (way too
large) numbers, and also NaN/Inf.
When using a smaller dataset I didn't have this problem.
The error disappears after I changed this: in utils/__init__.py:
population = events["at_risk"] - entrances
population = population.astype(float)
So that's why I guess it is some sort of overflow issue (but my python
knowledge is limited...)
Hopefully this helps
Best regards,
Sheila
…On Sat, 30 Dec 2023 at 22:02, Cameron Davidson-Pilon < ***@***.***> wrote:
Hi @shilet <https://github.com/shilet>, are you able to provide an
example of this failing? AFAIK, integers in Python shouldn't overflow.
—
Reply to this email directly, view it on GitHub
<#1585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIK4VBJWXOOZZYEX4HU327LYMB6PDAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGYYDMNJTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)? |
Hi,
I can reproduce with the following:
y = np.random.randint(1, 1000, 100000000)
naf = NelsonAalenFitter(nelson_aalen_smoothing=False)
naf.fit(y ,event_observed=None, timeline=range(0, int(y.max())))
…On Sun, 31 Dec 2023 at 18:49, Cameron Davidson-Pilon < ***@***.***> wrote:
hm, I'm not able to reproduce it. Can you share the dataset (unlikely)?
Can you tell me more about your dataset (any negative values, any late
entry being used)?
—
Reply to this email directly, view it on GitHub
<#1585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIK4VBKDV3JQ2TEFBYJ3MWTYMGQSPAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGAYDIOJRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3 to def _variance_f_discrete(self, population, deaths):
return (1 - deaths / population) * (deaths / population) * (1. / population) will be fixed in the next release |
Hi,
Happily you could reproduce it.
Is there a reason you didn't repair it by converting population to float
(population = population.astype(float))?
…On Mon, 1 Jan 2024 at 22:15, Cameron Davidson-Pilon < ***@***.***> wrote:
Thanks, I was able to reproduce it with your example. I believe the
problem *is* integer overflow, as you predicted. It's correct that
*Python* integers don't overflow, but when they are inside a Pandas
series, they are int64, which can overflow. The population**3 term in the
calculation was creating a negative value where there shouldn't have been.
I've fixed it by avoiding that large exponent:
def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3
to
def _variance_f_discrete(self, population, deaths):
return (1 - deaths / population) * (deaths / population) * (1. / population)
—
Reply to this email directly, view it on GitHub
<#1585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIK4VBKA6IBUG3QOD4YXF33YMMRPFAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGQ4DEOJVHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
floats (or float64, in pandas) have the same problem: they'll overflow eventually. However, they'll overflow at a higher number. So why not use float64? While the upperbound of a float64 is higher than a int64, a float64s precision is severely reduced for larger numbers, so your final values aren't precise and could look quite silly. The implemented solution avoids computing these large numbers, so we can keep the precision of int64s while avoiding overflows. |
When determining the confidence interval for the Nelson Aalen Fitter, the following formula is used:
def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3
The problem is that population (population = events["at_risk"] - entrances) is an integer, this should be changed to float to prevent overflow. Otherwise the resulting confidence interval is NaN.
The text was updated successfully, but these errors were encountered: