-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Workaround when creating a FlatDataList with nulled values. #23
Conversation
…ist, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you noted, the Dataframe.getXColumn() method does not return only numeric/double values. I don't think this workaround should be placed here.
Noted... any tips on how should I try to "fix" this? Not implying that is broken, but in some cases when i try to create and train a Text Classifier it fails due to this. For example, when using a StandardScaler for the NumericalScaler in the TrainingParameters of the Classifier. If i try to train (fit() method) the Classifier with what i suppose is a well constructed Dataframe, it fails at that point, because the method Dataframe.getXColumn() returns a FlatDataList with null values in it. Also, if i simply try to ignore those values and therefore I dont even add them to the FlatDataList, some methods like Descriptives.variance() fail due to the size of the FlatDataList being 1 or less (and as a side note, my understanding about the variance tells me this isn't a good idea either way). Edit: maybe in TypeInference.toDouble(Object v)? like this:
for this:
|
Hey @jluis2k10, thanks for reporting this. :) How did you end up having null values in the Dataframe? Do you encode with null the missing values or it's the result of some other transformation done on the original data? IF this needs patching, one potential solution is to change the Descriptives class to ignore null values. This should be a simple fix but there are a few gotchas. For example let's say you want to estimate the average of the following list: [1,2,null,6]. If you interpret nulls as missing values, then the average is (1+2+6)/3=3 not (1+2+6)/4. Moreover please note that this is a policy change so we need to sure it is necessary. Please note that I'm not questioning whether the Descriptives class throws an exception if the list contains nulls. That is obvious. My question is how/why you ended up having nulls in your Dataframe in the first place. Perhaps the bug lives somewhere else. Could you open a ticket and provide a snippet that reproduces the problem? That could help us investigate the problem and discuss next steps. :) |
Yes i'm aware of the inconvenience of simply ignoring the nulls, that's why i tried to convert them to 0.0. And i don't have nulls in the Dataframe, this happens when in Dataframe.getXColumn() the parameter I don't have a neat and concise piece of code tho show, but I will try to write some asap. And i'm sorry, but what about my edit on my previous comment? Is it not a good idea to just return 0.0 if nullwhen converting to double? |
This is not about convinience/inconvinience but rather about ensuring that the math/algorithms will not break. We can't convert nulls to zeros. The semantics would not be correct, it's a hug change that affects the entire framework and could potentially mute or cause other problems. Some classes of the framework are supposed to throw an exception if you have nulls or incorrect values stored in the Dataframe. In Machine Learning, data need to be preprocessed before throwing them into algorithms. For example if indeed your column is non empty in only to 1 record, trying to perform standardization/scaling with an algorithm that estimates the variance (which requires more than 1 observation) shoud fail. Try writing a concise/simple piece of code that reproduces the problem and opening a ticket so that we can investigate further. :) |
Sometimes when creating a FlatDataList null values are added to the List, and consequently, when the library tries to retrieve those values it fails with an exception. For example, the method variance() in core.statistics.descriptivestatistics.Descriptive, fails in line 293 (double v = it.next()) when flatDataCollection has some of his values nulled.